DEPENDENT VARIABLE
In this section, we investigate the dependent variable prior to any
filtering.
Our dependent variable is the number of permits issued per census
tract. The unit of analysis is permits per census tract because of
census data availability and the fact that federal and state funding
often flows to areas via census tract or zip code eligibility. We time
and or space to see if there there are any patterns to missing permit
data so as to better determine which years to test and train on.
# See if there are any patterns to missing spatial data
res_permits %>%
group_by(year = year(as.Date(permitissuedate))) %>%
summarise_all(funs(sum(is.na(.)))) %>%
datatable(options = list(pageLength = 12),
style = "auto")
When looking at council district vs. year distribution of permit
data, there are plenty of permits in each district, but wide differences
by year. Permitting activity in earlier years is much lower than 2020 to
2023. Although 2020 is typically considered an anomaly, permit issuance
and development was strong due to low interest rates. Therefore, we
trained 2020 and 2021 permit issuance data to test on 2022 development
activity. As a result, we got rid of missing spatial data
res_permits %>%
group_by(council_district) %>%
summarise(count = n()) %>%
ggplot(aes(y = as.factor(council_district), x = count)) +
geom_bar(stat = "identity", fill = "#ffb600") +
plotTheme() +
labs(title = "Permits by Council District", caption = "Figure 1")

res_permits %>%
group_by(year = year(as.Date(permitissuedate))) %>%
st_drop_geometry() %>%
summarise(count = n()) %>%
ggplot(aes (x = year, y = count)) +
geom_bar(stat = "identity", fill = "#ffb600") +
plotTheme() +
labs(title = "Permits by Year", caption = "Figure 2")

res_permits <- res_permits %>%
filter(!is.na(lng) & !is.na(lat)) %>%
st_as_sf(coords = c("lng","lat"),crs=4326)
Next, we examined permit types and counts to understand the
distribution of permit types specific to Philadelphia. When looking at
the count of permit types, as represented below by n, we can see that
the permit type alone does not indicate what type of work is being done.
New construction will necessitate different permit types, such as
plumbing, mechanical, or electrical permits, but these can also be
counted in regular maintenance. Therefore, filtering only based on new
construction may be misleading. We looked at census data next to begin
understanding how permit data is associated with demographics and our
gentrification indicator.
res_permits %>%
group_by(permitdescription, typeofwork) %>%
st_drop_geometry() %>%
summarise(n = n()) %>%
arrange(permitdescription, desc(n)) %>%
datatable(options = list(pageLength = 10),
style = "auto")
phl_tracts4326 <- tracts(state = "PA",
county = "Philadelphia",
year = "2020") %>%
select(GEOID, geometry) %>%
st_set_crs(4326)
res_permits2 <- res_permits %>%
group_by(permitdescription, typeofwork) %>%
mutate(n = n()) %>%
ungroup() %>%
select(permittype, permitdescription, typeofwork, geometry, permitissuedate) %>%
#join to census tract
st_join(phl_tracts4326)
When we look at number of permits by type per tract as seen below, it
becomes apparent there are a very high concentration of all permits in
the Fishtown, Port Richmond area. When we look by year, we see that some
of this is because of data unavailability in earlier years followed by a
flux of permit issuance from 2020 to 2022.
So what happened these years?
Interest rates were at a historical low in 2020 to 2022, in part
to entice economic activity to be as enticing as possible after the
COVID-19 pandemic brought the economy to a screeching halt. 
More locally, Philadelphia passed a tax policy in 2000 to provide
a tax abatement on new housing construction until it
expired in 2022. The policy goal was to spur development for
residential property owners by providing a 10-year tax abatement on the
value of improvements. This partially explains the time lag of some
development, but it remains unclear why the development was so
concentrated spatially.
Our next data analysis step is to drop the year and look only at
permits in and after 2018. It is likely that more permits are issued in
areas where these changes are already happening.
# PICK 1
tract_permits_year <- res_permits2 %>%
st_drop_geometry() %>%
mutate(permit = paste(permitdescription, "-", typeofwork),
year = year(as.Date(permitissuedate))) %>%
select(-permitissuedate, -permitdescription, -permittype, -typeofwork) %>%
gather(Variable, value, -GEOID, -year) %>%
dplyr::count(Variable, value, GEOID, year) %>%
group_by(GEOID, year) %>%
mutate(TOTAL = sum(n)) %>%
ungroup() %>%
spread(key = value, value = n) %>%
select(-Variable) %>%
replace(is.na(.), 0)
#total permits
tract_permits_year %>%
left_join(phl_tracts4326) %>%
st_sf() %>%
filter(year > 2014 & year < 2023) %>%
ggplot()+
geom_sf(aes(fill = cut(TOTAL, breaks = 8))) +
scale_fill_manual(values = palette1) +
mapTheme() +
labs(title = "Permits between 2014 - 2023", caption = "Figure 3")

#by year - there are a lot of tracts missing in earlier years
# tract_permits_year %>%
# left_join(phl_tracts4326) %>%
# st_sf() %>%
# filter(year > 2014 & year <= 2023) %>%
# ggplot()+
# geom_sf(aes(fill = TOTAL)) +
# scale_fill_binned(n.breaks=8) +
# facet_wrap(~year) +
# mapTheme()
tract_permits <- res_permits2 %>%
st_drop_geometry() %>%
mutate(permit = paste(permitdescription, "-", typeofwork),
year = year(as.Date(permitissuedate))) %>%
filter(year >= 2018) %>%
select(-permitissuedate, -permitdescription, -permittype, -typeofwork, -year) %>%
gather(Variable, value, -GEOID) %>%
dplyr::count(Variable, value, GEOID) %>%
group_by(GEOID) %>%
mutate(TOTAL = sum(n)) %>%
ungroup() %>%
spread(key = value, value = n) %>%
select(-Variable) %>%
replace(is.na(.), 0)
PERMIT SELECTION
Census Data
But a key consideration is: does
demand lead supply or vice versa? Which comes first, development or
gentrification?
There are two theories of neighborhood change as it relates to this.
- 1) Housing demand of middle-class, mostly white residents encourages
them to leave neighborhoods to seek out more affordable neighborhoods. -
2) Development supply of new construction in historically poorer
neighborhoods encourages an influx of new residents, who are encouraged
to move to a neighborhood by the supply of housing.
The above study found that housing market growth (i.e. new permit
issuance) did not predict future increase in gentrification; rather,
development followed neighborhood change. Gentrification features
themselves could be considered indicators of future development.
As a result of permit analysis, we selected permits that were
associated with gentrification indicators. Using prior
research on gentrification indicators and using publicly available
census data, we examined the following demographic characteristics:
- All Census tracts in Philadelphia, 5Y ACS for 2021 and for 2016
- Median HH income, CPI-inflation adjusted
- Median home value, CPI-inflation adjusted
- Median rent, CPI-inflation adjusted
- Median year structure built
- Tenure - owner-occupied
- Number of housing units
- Number of vacancies
- Count of second mortgages
- Percent white population
- Percent with Bachelor’s degree
We calculated: rent to income ratio, net changes of multiple
indicators from 2017 to 2022, and percent changes from 2017 to 2022.
Median rents, values, and incomes were adjusted to 2022 dollars.
CPI2017 <- 242.839
CPI2022 <- 281.148
CPI <- CPI2022/CPI2017
# NOTE: currently replacing NA values with a very small value...
phlCensus22 <-
get_acs(geography = "tract",
variables = c("B01003_001", "B19013_001",
"B02001_002", "B25077_001",
"B25064_001", "B25035_001",
"B25003_001", "B25003_002",
"B25001_001", "B25002_001",
"B25002_003", "B25081_001",
"B25081_004", "B15003_001",
"B15003_022"
),
year = 2022,
dataset = "acs5",
state = "PA",
geometry = FALSE,
county = c("Philadelphia"),
output = "wide") %>%
rename(Total_Pop = B01003_001E,
Med_Inc = B19013_001E,
White_Pop = B02001_002E,
Med_Value = B25077_001E,
Med_Rent = B25064_001E,
Med_Structure = B25035_001E,
Owner_univ = B25003_001E,
Owners = B25003_002E,
Tot_Units = B25001_001E,
Vac_univ = B25002_001E,
Vacants = B25002_003E,
Mort_status_univ = B25081_001E,
Sec_Mort = B25081_004E,
Bach_univ = B15003_001E,
Bach = B15003_022E
) %>%
dplyr::select(Total_Pop, Med_Inc, White_Pop,
Med_Value, Med_Rent, Med_Structure,
Owner_univ, Owners, Tot_Units,
Vac_univ, Vacants, Mort_status_univ,
Sec_Mort, Bach_univ, Bach, GEOID) %>%
mutate(Percent_White = White_Pop / Total_Pop,
Percent_Owner = Owners / Owner_univ,
Percent_Vacant = Vacants / Vac_univ,
Percent_2mort = Sec_Mort / Mort_status_univ,
Percent_bach = Bach / Bach_univ,
Rent_Income_Ratio = Med_Rent / (Med_Inc/12)
) %>%
#replace(is.na(.), .00000001) %>%
select(-ends_with("_univ"))
phlCensus17 <-
get_acs(geography = "tract",
variables = c("B01003_001", "B19013_001",
"B02001_002", "B25077_001",
"B25064_001", "B25035_001",
"B25003_001", "B25003_002",
"B25001_001", "B25002_001",
"B25002_003", "B25081_001",
"B25081_004", "B15003_001",
"B15003_022"
),
year = 2017,
state = "PA",
geometry = FALSE,
county=c("Philadelphia"),
output = "wide") %>%
rename(Total_Pop = B01003_001E,
Med_Inc = B19013_001E,
White_Pop = B02001_002E,
Med_Value = B25077_001E,
Med_Rent = B25064_001E,
Med_Structure = B25035_001E,
Owner_univ = B25003_001E,
Owners = B25003_002E,
Tot_Units = B25001_001E,
Vac_univ = B25002_001E,
Vacants = B25002_003E,
Mort_status_univ = B25081_001E,
Sec_Mort = B25081_004E,
Bach_univ = B15003_001E,
Bach = B15003_022E
) %>%
dplyr::select(Total_Pop, Med_Inc, White_Pop,
Med_Value, Med_Rent, Med_Structure,
Owner_univ, Owners, Tot_Units,
Vac_univ, Vacants, Mort_status_univ,
Sec_Mort, Bach_univ, Bach, GEOID) %>%
mutate(
Med_Rent = Med_Rent * CPI,
Med_Value = Med_Value * CPI,
Percent_White = White_Pop / Total_Pop,
Percent_Owner = Owners / Owner_univ,
Percent_Vacant = Vacants / Vac_univ,
Percent_2mort = Sec_Mort / Mort_status_univ,
Percent_bach = Bach / Bach_univ,
Rent_Income_Ratio = Med_Rent / (Med_Inc/12),
) %>%
#replace(is.na(.), .00000001) %>%
select(-ends_with("_univ"))
# Interpolate based on population in order to translate 2017 tracts to 2020 tracts
# workaround because tidycensus wasn't downloading geometry
#get just population from phlCensus
pop_22 <- phlCensus22 %>%
select(GEOID, Total_Pop)
#get tracts; join population to tracts
phl_tracts22 <- tracts(state = "PA",
county = "Philadelphia",
year = "2022") %>%
left_join(pop_22) %>%
st_sf()
#join tracts to Census data
phlCensus22 <- phlCensus22 %>%
left_join(phl_tracts22 %>%
select(GEOID, geometry)) %>%
st_sf()
#get 2017 tracts
phl_tracts17 <- tracts(state = "PA",
county = "Philadelphia",
year = "2017")
#join tracts to 2017 Census data
phlCensus17 <- phlCensus17 %>%
left_join(phl_tracts17 %>%
select(GEOID,geometry)) %>%
st_sf()
sf_use_s2(FALSE)
census17_interpolate_pw <- interpolate_pw(
phlCensus17,
phlCensus22,
to_id = "GEOID",
extensive = TRUE,
weights = phl_tracts22,
weight_column = "Total_Pop",
crs = 4269)
census_22_and_comparison <- phlCensus22 %>%
left_join(st_drop_geometry(census17_interpolate_pw),
by = "GEOID",
suffix = c("_2022", "_2017")) %>%
mutate(Total_Pop_NET = Total_Pop_2022 - Total_Pop_2017,
Med_Inc_NET = Med_Inc_2022 - Med_Inc_2017,
White_Pop_NET = White_Pop_2022 - White_Pop_2017,
Med_Value_NET = Med_Value_2022 - Med_Value_2017,
Med_Rent_NET = Med_Rent_2022 - Med_Rent_2017,
Owners_NET = Owners_2022 - Owners_2017,
Tot_Units_NET = Tot_Units_2022 - Tot_Units_2017,
Vacants_NET = Vacants_2022 - Vacants_2017,
Sec_Mort_NET = Sec_Mort_2022 - Sec_Mort_2017,
Percent_White_NET = Percent_White_2022 - Percent_White_2017,
Percent_Owner_NET = Percent_Owner_2022 - Percent_Owner_2017,
Percent_Vacant_NET = Percent_Vacant_2022 - Percent_Vacant_2017,
Percent_2mort_NET = Percent_2mort_2022 - Percent_2mort_2017,
Rent_Income_Ratio_NET = Rent_Income_Ratio_2022 - Rent_Income_Ratio_2017,
Bach_NET = Bach_2022 - Bach_2017,
Percent_bach_NET = Percent_bach_2022 - Percent_bach_2017,
Total_Pop_PCT = ifelse(Total_Pop_2017>0, (Total_Pop_NET / Total_Pop_2017), NA)*100,
Med_Inc_PCT = ifelse(Med_Inc_2017 >0, (Med_Inc_NET / Med_Inc_2017), NA)*100,
White_Pop_PCT = ifelse(White_Pop_2017 >0, (White_Pop_NET / White_Pop_2017), NA)*100,
Med_Value_PCT = ifelse(Med_Value_2017>0, (Med_Value_NET / Med_Value_2017), NA)*100,
Med_Rent_PCT = ifelse(Med_Rent_2017>0, (Med_Rent_NET / Med_Rent_2017),NA)*100,
Owners_PCT = ifelse(Owners_2017>0, ( Owners_NET / Owners_2017),NA)*100,
Tot_Units_PCT = ifelse(Tot_Units_2017>0, (Tot_Units_NET / Tot_Units_2017),NA)*100,
Vacants_PCT = ifelse(Vacants_2017>0,(Vacants_NET / Vacants_2017),NA)*100,
Sec_Mort_PCT = ifelse(Sec_Mort_2017>0,(Sec_Mort_NET / Sec_Mort_2017),NA)*100,
Percent_White_PCT =ifelse(Percent_White_2017>0,(Percent_White_NET / Percent_White_2017),NA)*100,
# is it bad to calculate percent change using percentages?
Percent_Owner_PCT = ifelse(Percent_Owner_2017>0,(Percent_Owner_NET / Percent_Owner_2017),NA)*100,
Percent_Vacant_PCT = ifelse(Percent_Vacant_2017>0,(Percent_Vacant_NET / Percent_Vacant_2017),NA)*100,
Percent_2mort_PCT = ifelse(Percent_2mort_2017>0,(Percent_2mort_NET / Percent_2mort_2017),NA)*100,
Rent_Income_Ratio_PCT = ifelse(Rent_Income_Ratio_2017>0, (Rent_Income_Ratio_NET / Rent_Income_Ratio_2017),NA)*100,
Percent_bach_PCT = ifelse(Percent_bach_2017>0, (Percent_bach_NET / Percent_bach_2017),NA)*100) %>%
relocate(GEOID)
Gentrification Indicator
We developed a gentrification indicator methodology based on a
Drexel study. This is just one possible indicator and could be
improved upon or changed by incorporating additional demographic
variables.
These four gentrification categories we use to characterize tracts
are:
- Ineligible
- Not gentrifying
- Gentrifying
- Intensely Gentrifying
We can then see for which characterizations there are differences in
permit types. If the mean number of permits for census tracts that are
gentrifying or intensely gentrifying is larger than for tracts that are
ineligible or not gentrifying, we keep that permit type.
indicator <- census_22_and_comparison %>%
mutate(indicator =
case_when(
# tracts with < 50 people or in top quartile of median income are excluded
Total_Pop_2022 <= 50 | Med_Inc_2017 > quantile(Med_Inc_2017, .75, na.rm = TRUE) ~ "Ineligible",
# tracts with below median increase in pct with bachelor's degree,
# below median changes in rent and home values are not gentrifying
Percent_bach_PCT < median(Percent_bach_NET, na.rm = TRUE) |
(Med_Value_NET < median(Med_Value_NET, na.rm = TRUE) &
Med_Rent_NET < median(Med_Rent_NET, na.rm = TRUE)) ~ "Not gentrifying",
# tracts with above-median percent increases in bachelor degree attainment
# and EITHER b/t .5 and .75 percentile increase in home values or rents = gentrifying
Percent_bach_PCT >= median(Percent_bach_NET, na.rm = TRUE) & (
(Med_Value_NET >= quantile(Med_Value_NET, .50, na.rm = TRUE) &
Med_Value_NET < quantile(Med_Value_NET, .75, na.rm = TRUE)) |
(Med_Rent_NET >= quantile(Med_Rent_NET, .50, na.rm = TRUE) &
Med_Rent_NET < quantile(Med_Rent_NET, .75, na.rm = TRUE))
) ~ "Gentrifying",
# tracts with above-median percent increases in bachelor degree attainment
# and EITHER in .75 percentile increase in home values or rents = intensely gentrifying
Percent_bach_PCT >= median(Percent_bach_NET, na.rm = TRUE) &
(Med_Value_NET >= quantile(Med_Value_NET,.75, na.rm = TRUE) |
Med_Rent_NET >= quantile(Med_Rent_NET,.75, na.rm = TRUE)
)
~ "Intensely gentrifying")
) %>%
select(GEOID, indicator, Total_Pop_2022, Total_Pop_2017,
Med_Inc_2022, Med_Inc_2017,
Percent_bach_NET, Med_Value_NET, Med_Rent_NET)
Permit Selection
We can see below how each permit type relates to each gentrification
indicator, and we can also calculate the net difference and percent
difference between the number of permits in not gentrifying tracts
versus intensely gentrifying tracts.
indicator_table <- indicator %>%
dplyr::select(-Total_Pop_2022, -Total_Pop_2017,
-Med_Inc_2022, -Med_Inc_2017,
-Percent_bach_NET, -Med_Value_NET, -Med_Rent_NET) %>%
left_join(tract_permits) %>%
st_drop_geometry() %>%
select(-GEOID) %>%
gather(Variable, value, -indicator) %>%
group_by(indicator, Variable) %>%
summarise_if(is.numeric, mean, na.rm = TRUE) %>%
ungroup() %>%
spread(key = indicator, value = value) %>%
mutate_if(is.numeric, round, digits = 3) %>%
select(Variable, Ineligible, `Not gentrifying`, Gentrifying, `Intensely gentrifying`)
keep <- indicator_table %>%
mutate(g_to_non_g_PCT = round((Gentrifying - `Not gentrifying`) / `Not gentrifying`, 4) *100,
ig_to_non_g_PCT = round((`Intensely gentrifying` - `Not gentrifying`) / `Not gentrifying`, 4) *100) %>%
filter(Gentrifying >= 1 & `Intensely gentrifying`>=1 & g_to_non_g_PCT > 0 & ig_to_non_g_PCT > 0) %>%
select(-g_to_non_g_PCT, -ig_to_non_g_PCT)
keep %>%
datatable()
keep <- unique(keep$Variable)
Here, we filter only for permits where there is a non-zero percent
difference between intensely gentrifying tracts and non-gentrifying
tracts, and keep only those permits and attach that to census data, so
each tract is assigned an indicator for whether or not it is
gentrifying.
tract_permits_final <- res_permits2 %>%
st_drop_geometry() %>%
mutate(permit = paste(permitdescription, "-", typeofwork),
year = year(as.Date(permitissuedate))) %>%
filter(year >= 2018 & permit %in% keep) %>%
select(-permitissuedate, -permitdescription, -permittype, -typeofwork) %>%
gather(Variable, value, -GEOID, -year) %>%
dplyr::count(Variable, value, GEOID, year) %>%
group_by(GEOID, year) %>%
mutate(TOTAL = sum(n)) %>%
ungroup() %>%
spread(key = value, value = n) %>%
select(GEOID, year, TOTAL) %>%
replace(is.na(.), 0)
census_permits <-
expand.grid(GEOID = unique(tract_permits_final$GEOID),
year=unique(tract_permits_final$year)) %>%
left_join(indicator %>% select(GEOID, indicator)) %>%
left_join(tract_permits_final) %>%
left_join(census_22_and_comparison %>% st_drop_geometry()) %>%
mutate(TOTAL = ifelse(is.na(TOTAL), 0, TOTAL)) %>%
filter(Total_Pop_2022 > 0) %>%
select(-geometry) %>%
left_join(phl_tracts4326) %>%
st_as_sf()
census_permits$indicator <- factor(census_permits$indicator, ordered = TRUE, levels = c("Ineligible", "Not gentrifying", "Gentrifying", "Intensely gentrifying"))
Exploratory Analysis of Filtered Dependent Variable
Now that we have filtered what will ultimately become our dependent
variable, we can see that there are many census tracts where
there are no permits issued across all years.
ggplot(census_permits) +
geom_histogram(aes(x = TOTAL), bins = 40, fill = "#ffb600") +
labs(title = "Permit Distribution", caption = "Figure 4")

There was a huge spike of permits issued in 2020 citywide, likely
spurred by the announced rollback
of the tax abatement period and the historically low interest rates.
As a result, 2021
was a major permit boom year and December
2021 experienced the the highest count of new construction permits
issued since January 2016. The boom
continued in 2022 and news articles throughout the years commented
on the “absurd”
number of new consruction permits issued in Philadelphia.
census_permits %>%
group_by(year) %>%
summarise(TOTAL = sum(TOTAL)) %>%
filter(!is.na(year)) %>%
st_drop_geometry() %>%
ggplot(aes (x = year, y = TOTAL)) +
geom_bar(stat = "identity", fill = "#ffb600") +
plotTheme() +
labs(title = "Permits by Year", caption = "Figure 5")

National and city-wide policies may be influencing development on a
city scale. Gentrification indicators provide a clue as to which tracts
are experiencing change, or very rapid change, if at all. But we haven’t
yet honed in on neighborhoods to guide our feature engineering. We
looked at permits by neighborhood for some clues and could immediately
see that there were high permits issued for neighborhoods that have been
the subject of locla articles about gentrification, including Fishtown,
Point
Breeze, and Lower
Kensington
res_permits3 <- res_permits %>%
group_by(permitdescription, typeofwork) %>%
mutate(n = n()) %>%
ungroup() %>%
select(permittype, permitdescription, typeofwork, geometry, permitissuedate) %>%
st_join(phl.nh) %>%
mutate(permit = paste(permitdescription, "-", typeofwork),
year = year(as.Date(permitissuedate))) %>%
filter(year>=2018 & permit %in% keep)
res_permits3 <- res_permits3 %>%
st_drop_geometry() %>%
group_by(mapname, year) %>%
summarise(n = n()) %>%
arrange(desc(n))
res_permits3 %>%
datatable()
When we look at this over time, we can see clearly that the
neighborhood patterns remain the same, with the same neighborhoods
receiving permits across a 6-year period.
res_permits3 <- res_permits3 %>%
left_join(phl.nh)
res_permits3 %>%
ungroup() %>%
st_as_sf() %>%
ggplot() +
geom_sf(aes(fill = cut(n, breaks = 7))) +
scale_fill_manual(values = palette1) +
facet_wrap(~year) +
mapTheme() +
labs(title = "Permits by Year, Philadelphia", caption = "Figure 6")

Now that we have explored gentrification indicators across
Philadelphia neighborhoods and tracts, permits issued over time and
space, and some basic demographic characteristics, we developed a
composite map where we can see which neighborhoods have gentrification
indicators and are seeing a lot of new development for a given year.
Click through the years to look at how gentrification indicators and
permits change over time.
In 2021, there are area that are gentrifying but not seeing a lot of
new development - particularly in the Northeast and Northwest.
Predicting development is therefore going to be slightly different than
predicting changes in rents, home values, and other demographic
characteristics like racial makeup.
bins = c(0,5,30,50,100,200,300,Inf)
permits18 <- colorBin(
palette = "viridis",
bins = bins,
domain = census_permits %>% filter(year == 2018) %>% .$TOTAL)
permits19 <- colorBin(
palette = "viridis",
bins = bins,
domain = census_permits %>% filter(year == 2019) %>% .$TOTAL)
permits20 <- colorBin(
palette = "viridis",
bins = bins,
domain = census_permits %>% filter(year == 2020) %>% .$TOTAL)
permits21 <- colorBin(
palette = "viridis",
bins = bins,
domain = census_permits %>% filter(year == 2021) %>% .$TOTAL)
permits22 <- colorBin(
palette = "viridis",
bins = bins,
domain = census_permits %>% filter(year == 2022) %>% .$TOTAL)
permits23 <- colorBin(
palette = "viridis",
bins = bins,
domain = census_permits %>% filter(year == 2023) %>% .$TOTAL)
indicator_pal <- colorFactor(palette = c("#c191d9", "#6d5480","#ff66f0", "#fccd32"),
domain = census_permits$indicator)
permits_map <- leaflet() %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addPolygons(data = census_permits %>% filter(year == 2023),
stroke = TRUE,
color = ~indicator_pal(indicator),
weight = 2,
opacity = 1,
fillOpacity = .5,
fillColor = ~permits23(TOTAL),
label = paste("Total Permits:", census_permits %>% filter(year == 2023) %>% .$TOTAL),
group = "2023") %>%
addPolygons(data = census_permits %>% filter(year == 2022),
stroke = TRUE,
color = ~indicator_pal(indicator),
weight = 2,
opacity = 1,
fillOpacity = .5,
fillColor = ~permits22(TOTAL),
label = paste("Total Permits:", census_permits %>% filter(year == 2022) %>% .$TOTAL),
group = "2022") %>%
addPolygons(data = census_permits %>% filter(year == 2021),
stroke = TRUE,
color = ~indicator_pal(indicator),
weight = 2,
opacity = 1,
fillOpacity = .5,
fillColor = ~permits21(TOTAL),
label = paste("Total Permits:", census_permits %>% filter(year == 2021) %>% .$TOTAL),
group = "2021") %>%
addPolygons(data = census_permits %>% filter(year == 2020),
stroke = TRUE,
color = ~indicator_pal(indicator),
weight = 2,
opacity = 1,
fillOpacity = .5,
fillColor = ~permits20(TOTAL),
label = paste("Total Permits:", census_permits %>% filter(year == 2020) %>% .$TOTAL),
group = "2020") %>%
addPolygons(data = census_permits %>% filter(year == 2019),
stroke = TRUE,
color = ~indicator_pal(indicator),
weight = 2,
opacity = 1,
fillOpacity = .5,
fillColor = ~permits19(TOTAL),
label = paste("Total Permits:", census_permits %>% filter(year == 2019) %>% .$TOTAL),
group = "2019") %>%
addPolygons(data = census_permits %>% filter(year == 2018),
stroke = TRUE,
color = ~indicator_pal(indicator),
weight = 2,
opacity = 1,
fillOpacity = .5,
fillColor = ~permits18(TOTAL),
group = "2018",
label = paste("Total Permits:", census_permits %>% filter(year == 2018) %>% .$TOTAL)) %>%
addLegend(data = census_permits,
pal = indicator_pal,
value = ~indicator,
title = "Gentrification Indicator",
position = "bottomleft") %>%
addLayersControl(
baseGroups = c("2023", "2022", "2021", "2020", "2019", "2018"),
options = layersControlOptions(collapsed = FALSE))
permits_map %>%
setView("-75.139446","39.996849",zoom = 11)
When we just look at our gentrification indicator alone (without
permits), we can see that not all development is a result of
gentrification, but there are still some areas where the two overlap.
Although this project aims to provide targeted areas for Housing Trust
Funds, we have to recognize that permits alone do not perfectly align
with gentrification, but merely reflect new development.
census_permits %>%
filter(!is.na(year)) %>%
st_sf() %>%
ggplot()+
geom_sf(aes(fill = indicator)) +
scale_fill_manual(values = palette1) +
mapTheme() +
labs(title = "Gentrification Indicators", caption = "Figure 7")

FEATURE ENGINEERING
Spatial and spatial time lags, Census data
Developing spatial and spatial time lags can help us capture the
spatial endogeneity of gentrification as discussed here.
We created a spatial weights matrix based on contiguity relationships of
the census tracts sharing edges and vertices. Specifically, we looked at
the nearest neighbor for prices, rents, and demographics of neighboring
census tracts as possible key variables to predict for permit
development.
We also created a lag for permits issued to last year’s neighbors.
Spatial lag and spatial time lag has major predictive potential for our
modeling efforts. The average permit count for the nearest census tract
can be helpful, but utilizing the nearest neighbor for social
demographic data has the potential to be more influence. As high-demand
neighborhoods become more price unattainable, residents priced out try
to remain close and choose neighborhoods with some adjacent
characteristics such as bachelors degerees, income, or white residents,
and therefore more likelihood of higher social integration.
nb <- st_contiguity(census_permits$geometry, queen = TRUE)
wt <- st_weights(nb)
census_permits <- census_permits %>%
mutate(nn_mean_rent_22 = st_lag(Med_Rent_2022,nb,wt, na_ok = TRUE),
nn_mean_price_22 = st_lag(Med_Value_2022, nb,wt, na_ok = TRUE),
nn_mean_rent_17 = st_lag(Med_Rent_2017,nb,wt, na_ok = TRUE),
nn_mean_price_17 = st_lag(Med_Value_2017, nb,wt, na_ok = TRUE),
nn_bach_22 = st_lag(Percent_bach_2022, nb,wt, na_ok = TRUE),
nn_bach_17 = st_lag(Percent_bach_2017, nb,wt, na_ok = TRUE),
nn_income_22 = st_lag(Med_Inc_2022, nb,wt, na_ok = TRUE),
nn_income_17 = st_lag(Med_Inc_2017, nb,wt, na_ok = TRUE),
nn_pct_white_22 = st_lag(Percent_White_2022, nb,wt, na_ok = TRUE),
nn_pct_white_17 = st_lag(Percent_White_2017, nb,wt, na_ok = TRUE)
)
census_permits <- census_permits %>%
group_by(GEOID) %>%
arrange(GEOID, year) %>%
mutate(last_year_TOTAL = dplyr::lag(TOTAL),
last_year_TOTAL = ifelse(is.na(last_year_TOTAL), TOTAL, last_year_TOTAL)) %>%
ungroup() %>%
mutate(nn_LY_permits = st_lag(last_year_TOTAL, nb, wt))
Local Moran’s I
Moran’s I can help identify spatial autocorrelation, and similarity
among clusters of values. Moran’s I values throughout Philadelphia are
largely above 1, indicating statistically significant clusters wherein
nearest neighbors are having an impact on permit count. Below, we can
see that that most of Philadelphia has very low Moran’s I values except
for Fishtown and University City, which are outliers. Similarly, there
are higher probability values in the areas surrounding Fishtown and
University City, particularly in West Philly.
Higher probability for the nearest neighborhoods surrounding the high
demand neighborhoods suggest that there is some uncertainty regarding
prediction for these tracts. Although this generally indicates that our
confidence for predicting these tracts is lower, the clustering of these
probability values suggests that the nearest neighbor and spatial lag is
indicative of some factor, even if we are not identifying it with
exactness.
nb <- poly2nb(census_permits$geometry, queen=TRUE)
wt <- nb2listw(nb, style="W", zero.policy=TRUE)
local_morans <- localmoran(census_permits$TOTAL, wt, zero.policy=TRUE) %>%
as.data.frame() %>%
rename(c("Local_Morans_I" = "Ii", "P_Value" = "Pr(z != E(Ii))"))
census_permits <-
cbind(census_permits, local_morans %>% select(Local_Morans_I, P_Value)) %>%
st_sf() %>%
mutate(Significant_Hotspots = ifelse(P_Value <= 0.001, 1, 0))
#Plot Morans I
localMorans <-
cbind(local_morans, census_permits) %>%
st_sf() %>%
dplyr::select(TOTAL,
Local_Morans_I,
P_Value) %>%
mutate(Significant_Hotspots = ifelse(P_Value <= 0.001, 1, 0)) %>%
gather(Variable, Value, -geometry)
## This is just for plotting
vars <- unique(localMorans$Variable)
varList <- list()
for(i in vars){
varList[[i]] <-
ggplot() +
geom_sf(data = filter(localMorans, Variable == i),
aes(fill = Value), colour=NA) +
scale_fill_viridis_b(name="") +
labs(title=i) +
mapTheme(title_size = 14) + theme(legend.position="bottom")
}
do.call(grid.arrange,c(varList, ncol = 4, top = "Local Morans I statistics, Total Permits"))

Additional Feature Engineering
Coffee shops
Finally, we generated a few amenity features such as stores or
infrastructure that may make a neighborhood more attractive. These
include cafes from 2018 to 2023 since the presence of new cafes may be
correlated with gentrification or new development.
#coffee shops by year, then join to census_permits.
# cafes18 <- getbb('Philadelphia, PA') %>%
# opq(datetime = '2018-01-01T00:00:01Z') %>%
# add_osm_feature(key = "amenity", value = "cafe") %>%
# osmdata_sf()
#
# cafes18.sf <- st_as_sf(cafes18$osm_points)
#
# cafes18.sf <- cafes18.sf %>%
# st_transform(crs = st_crs(census_permits)) %>%
# st_join(census_permits, join = st_within)
#
# cafes18.sf <- cafes18.sf %>%
# filter(!is.na(GEOID)) %>%
# group_by(GEOID) %>%
# summarise(cafe_count = n()) %>%
# mutate(cafeyear = 2018)
#
# cafes19 <- getbb('Philadelphia, PA') %>%
# opq(datetime = '2019-01-01T00:00:01Z') %>%
# add_osm_feature(key = "amenity", value = "cafe") %>%
# osmdata_sf()
#
# cafes19.sf <- st_as_sf(cafes19$osm_points) %>%
# st_transform(crs = st_crs(census_permits)) %>%
# st_join(census_permits, join = st_within) %>%
# filter(!is.na(GEOID)) %>%
# group_by(GEOID) %>%
# summarise(cafe_count = n()) %>%
# mutate(cafeyear = 2019)
#
# cafes20 <- getbb('Philadelphia, PA') %>%
# opq(datetime = '2020-01-01T00:00:01Z') %>%
# add_osm_feature(key = "amenity", value = "cafe") %>%
# osmdata_sf()
#
# cafes20.sf <- st_as_sf(cafes20$osm_points) %>%
# st_transform(crs = st_crs(census_permits)) %>%
# st_join(census_permits, join = st_within) %>%
# filter(!is.na(GEOID)) %>%
# group_by(GEOID) %>%
# summarise(cafe_count = n()) %>%
# mutate(cafeyear = 2020)
#
# cafes21 <- getbb('Philadelphia, PA') %>%
# opq(datetime = '2021-01-01T00:00:01Z') %>%
# add_osm_feature(key = "amenity", value = "cafe") %>%
# osmdata_sf()
#
# cafes21.sf <- st_as_sf(cafes21$osm_points) %>%
# st_transform(crs = st_crs(census_permits)) %>%
# st_join(census_permits, join = st_within) %>%
# filter(!is.na(GEOID)) %>%
# group_by(GEOID) %>%
# summarise(cafe_count = n()) %>%
# mutate(cafeyear = 2021)
#
# cafes22 <- getbb('Philadelphia, PA') %>%
# opq(datetime = '2022-01-01T00:00:01Z') %>%
# add_osm_feature(key = "amenity", value = "cafe") %>%
# osmdata_sf()
#
# cafes22.sf <- st_as_sf(cafes22$osm_points) %>%
# st_transform(crs = st_crs(census_permits)) %>%
# st_join(census_permits, join = st_within) %>%
# filter(!is.na(GEOID)) %>%
# group_by(GEOID) %>%
# summarise(cafe_count = n()) %>%
# mutate(cafeyear = 2022)
#
# cafes23 <- getbb('Philadelphia, PA') %>%
# opq(datetime = '2023-01-01T00:00:01Z') %>%
# add_osm_feature(key = "amenity", value = "cafe") %>%
# osmdata_sf()
#
# cafes23.sf <- st_as_sf(cafes23$osm_points) %>%
# st_transform(crs = st_crs(census_permits)) %>%
# st_join(census_permits, join = st_within) %>%
# filter(!is.na(GEOID)) %>%
# group_by(GEOID) %>%
# summarise(cafe_count = n()) %>%
# mutate(cafeyear = 2023)
#
# cafes_list <- list(cafes18.sf, cafes19.sf, cafes20.sf, cafes21.sf, cafes22.sf,
# cafes23.sf)
#
# all_cafes <- reduce(cafes_list, bind_rows) %>%
# rename("year" = "cafeyear") %>%
# st_drop_geometry()
#
#
# write.csv(all_cafes, "Data/all_cafes.csv")
#
# rm(cafe18, cafes18.sf, cafes19, cafes19.sf, cafes20, cafes20.sf, cafes21, cafes21.sf, cafes22, cafes22.sf, cafes23, cafes23.sf)
all_cafes <- read.csv("Data/all_cafes.csv") %>%
select(-X)
all_cafes$GEOID <- as.character(all_cafes$GEOID)
census_permits <- census_permits %>%
left_join(all_cafes, by = c("GEOID", "year")) %>%
mutate(cafe_count = ifelse(is.na(cafe_count), 0, cafe_count))
Distance to transit stations
We also looked at proximity to fixed public transportation within the
city of Philadelphia, including regional rail, the Broad Street Line,
Market Frankford Line, and trolley lines.
regionalrail <- st_read("Data/Regional_Rail_Stations.geojson")
trolleys <- st_read("Data/Trolley_Stations.geojson") %>%
rename(Line_Name = LineAbbr,
Latitude = Lat,
Longitude = Lon,
Station_Na = StopName) %>%
select(FID, Line_Name, Station_Na, Latitude, Longitude, geometry)
metro <- st_read("Data/Highspeed_Stations.geojson") %>%
rename(Line_Name = Route,
Station_Na = Station)
transit <- rbind(regionalrail, trolleys, metro) %>%
st_transform(crs = st_crs(phl_tracts22))
phl_transit <- st_join(transit, phl_tracts22, join = st_within) %>%
filter(!is.na(GEOID)) %>%
select(Line_Name, Station_Na, Latitude, Longitude, GEOID, geometry) %>%
st_transform(crs = st_crs(census_permits))
census_permits <- census_permits %>%
mutate(nearest_transit = nn_function(st_coordinates(st_centroid(census_permits)), st_coordinates(phl_transit), k=1)*69.2)
Number of jobs within tract
Finally, we looked at job counts per tract. Although we initially
were going to look at places such as major universities or distance to
center city, ultimately, we recognized that these locations are
substitutes for job density. We looked at jobs over time, but overall
the proportion remains pretty similar across the years for which we have
data. Therefore, we interpolate
the geographies to 2020 tracts and then look at spatial lag – jobs in
surrounding census tracts. This feature could use some further
development, because our visual assessment is that people do not want to
live in the areas with the highest job density, but a within a certain
distance of those job centers.
#use Lehdr package to bring in jobs and to analyze where people are going by work and home census tract.
#derived from: https://github.com/jamgreen/lehdr
#pa jobs from LEHDR
#match with lodes data from previous year, grab data for 2019 - 2022
#match the individual permit sheet with the previous years job data
#
# pa_jobs19 <- grab_lodes(state = "pa",
# year = 2019,
# version = "LODES8",
# lodes_type = "od",
# job_type = "JT01",
# segment = "S000",
# state_part = "main",
# agg_geo = "tract") %>%
# select(-h_tract)
#
# pa_jobs20 <- grab_lodes(state = "pa",
# year = 2020,
# version = "LODES8",
# lodes_type = "od",
# job_type = "JT01",
# segment = "S000",
# state_part = "main",
# agg_geo = "tract")%>%
# select(-h_tract)
#
# pa_jobs21 <- grab_lodes(state = "pa",
# year = 2021,
# version = "LODES8",
# lodes_type = "od",
# job_type = "JT01",
# segment = "S000",
# state_part = "main",
# agg_geo = "tract")%>%
# select(-h_tract)
#
# #filter for only jobs that start or end in Philadelphia tracts
#
# phl_jobs19 <- pa_jobs19 %>%
# left_join(phl_tracts22, by = c("w_tract" = "GEOID")) %>%
# filter(!st_is_empty(geometry)) %>%
# st_as_sf() %>%
# rename(GEOID = w_tract) %>%
# group_by(GEOID) %>%
# summarise(job_count = n())
#
# phl_jobs19 <- interpolate_pw(
# phl_jobs19,
# phl_tracts22,
# to_id = "GEOID",
# extensive = TRUE,
# weights = phl_tracts22,
# weight_column = "Total_Pop",
# crs = 4269)
#
# phl_jobs20 <- pa_jobs20 %>%
# left_join(phl_tracts22, by = c("w_tract" = "GEOID")) %>%
# filter(!st_is_empty(geometry)) %>%
# st_as_sf() %>%
# rename(GEOID = w_tract) %>%
# group_by(GEOID) %>%
# summarise(job_count = n())
#
# phl_jobs21 <- pa_jobs21 %>%
# left_join(phl_tracts22, by = c("w_tract" = "GEOID")) %>%
# filter(!st_is_empty(geometry)) %>%
# st_as_sf() %>%
# rename(GEOID = w_tract) %>%
# group_by(GEOID) %>%
# summarise(job_count = n())
#
# phl_jobs_combined <- rbind(cbind(phl_jobs19, year = 2019),
# cbind(phl_jobs20, year = 2020),
# cbind(phl_jobs21, year = 2021))
#
# ggplot(phl_jobs_combined) +
# geom_sf(aes(fill = job_count)) +
# labs(title = "Job Distribution in Philadelphia", color = "Job Count") +
# mapTheme() +
# facet_wrap(~year, ncol = 3)
#
# phl_jobs_combined$year <- as.Numeric(phl_jobs_combined$year)
phl_jobs_combined <- read.csv("Data/phl_jobs.csv") %>%
select(-X) %>%
mutate(GEOID = as.character(GEOID)) %>%
rename("year_jobs" = "year")
census_permits <- census_permits %>%
mutate(year_jobs = case_when(year == 2018 ~ 2019,
year == 2022 ~ 2021,
year == 2023 ~ 2021,
year %in% c(2019,2020,2021) ~ year)) %>%
left_join(phl_jobs_combined,
by = c("GEOID","year_jobs")) %>%
select(-year_jobs)
phl_jobs_combined <- phl_jobs_combined %>%
left_join(phl_tracts22, by = c("GEOID" = "GEOID")) %>%
select(job_count, GEOID, geometry) %>%
st_as_sf()
nb <- st_contiguity(census_permits$geometry, queen = TRUE)
wt <- st_weights(nb)
census_permits <- census_permits %>%
mutate(jobs.nn= st_lag(job_count,nb,wt, na_ok = TRUE))
Neighborhoods
census_permits <- st_join(st_centroid(census_permits), phl.nh, join = st_within) %>%
st_drop_geometry() %>%
left_join(phl_tracts4326) %>%
st_as_sf()
EXPLORATORY ANALYSIS
Here, we look at our dependent variable, total permits, as a function
of our numeric variables. Last year’s permits and the Local Moran’s I
are most strongly correlated.
Some of these graphs show that permits spike for values in the middle
of the distribution. As an experiment, we could make a binary variable
for relationships where there’s a spike in the middle - between the 45th
and 55th percentile or not.
#filter the outliers
census_permits_test <- census_permits %>%
filter(TOTAL < 600) %>%
dplyr::select(-indicator, -mapname, -name, -Significant_Hotspots) %>%
gather(variable, value, -geometry, -year, -GEOID, -TOTAL)
ggplot(census_permits_test, aes(value, TOTAL)) +
geom_point() +
stat_smooth(aes(value, TOTAL),
method = "lm", se = FALSE, size = 1, colour="#ffb600") +
facet_wrap(~variable, scales= "free", ncol=4) +
labs(title = "Exploratory Analysis", caption = "Figure 8")

We can test this theory out on the variables that appear to have the
largest spikes in the middle. It turns out this is only true for three
variables – the middle percentile (between 45th and 55th percentile) is
related to the number of permits for the change in percentage of homes
with a second mortgage between 2017 and 2022 5-Year ACS surveys, for
2017 adjusted mean rent, and for percent owner in 2022.
census_permits_test <- census_permits %>%
mutate(
middle_bach_17 = ifelse(Bach_2017 > quantile(Bach_2017, .45, na.rm = TRUE) & Bach_2017 <= quantile(Bach_2017, .55, na.rm = TRUE), TRUE, FALSE),
#maybe this one
middle_rent_17 = ifelse(Med_Rent_2017 > quantile(Med_Rent_2017, .45, na.rm = TRUE) & Med_Rent_2017 <= quantile(Med_Rent_2017, .55, na.rm = TRUE), TRUE, FALSE),
middle_price_NET = ifelse(Med_Value_NET > quantile(Med_Value_NET, .45, na.rm = TRUE) & Med_Value_NET <= quantile(Med_Value_NET, .55, na.rm = TRUE), TRUE, FALSE),
middle_nn_bach_22 = ifelse(nn_bach_22 > quantile(nn_bach_22, .45, na.rm = TRUE) & nn_bach_22 <= quantile(nn_bach_22, .55, na.rm = TRUE), TRUE, FALSE),
middle_nn_income_22 = ifelse(nn_income_22 > quantile(nn_income_22, .45, na.rm = TRUE) & nn_income_22 <= quantile(nn_income_22, .55, na.rm = TRUE), TRUE, FALSE),
middle_nn_mean_price_22 = ifelse(nn_mean_price_22 > quantile(nn_mean_price_22, .45, na.rm = TRUE) & nn_mean_price_22 <= quantile(nn_mean_price_22, .55, na.rm = TRUE), TRUE, FALSE),
middle_nn_mean_rent_22 = ifelse(nn_mean_rent_22 > quantile(nn_mean_rent_22, .45, na.rm = TRUE) & nn_mean_rent_22 <= quantile(nn_mean_rent_22, .55, na.rm = TRUE), TRUE, FALSE),
middle_Owners_PCT = ifelse(Owners_PCT > quantile(Owners_PCT, .45, na.rm = TRUE) & Owners_PCT <= quantile(Owners_PCT, .55, na.rm = TRUE), TRUE, FALSE),
#this one
middle_Percent_2mort_NET = ifelse(Percent_2mort_NET > quantile(Percent_2mort_NET, .45, na.rm = TRUE) & Percent_2mort_NET <= quantile(Percent_2mort_NET, .55, na.rm = TRUE), TRUE, FALSE),
middle_Percent_bach_2022 = ifelse(Percent_bach_2022 > quantile(Percent_bach_2022, .45, na.rm = TRUE) & Percent_bach_2022 <= quantile(Percent_bach_2022, .55, na.rm = TRUE), TRUE, FALSE),
middle_Percent_bach_PCT = ifelse(Percent_bach_PCT > quantile(Percent_bach_PCT, .45, na.rm = TRUE) & Percent_bach_PCT <= quantile(Percent_bach_PCT, .55, na.rm = TRUE), TRUE, FALSE),
#this one
middle_Percent_Owner_2022 = ifelse(Percent_Owner_2022 > quantile(Percent_Owner_2022, .45, na.rm = TRUE) & Percent_Owner_2022 <= quantile(Percent_Owner_2022, .55, na.rm = TRUE), TRUE, FALSE),
middle_Rent_Income_Ratio_2017 = ifelse(Rent_Income_Ratio_2017 > quantile(Rent_Income_Ratio_2017, .45, na.rm = TRUE) & Rent_Income_Ratio_2017 <= quantile(Rent_Income_Ratio_2017, .55, na.rm = TRUE), TRUE, FALSE),
#Also testing this new binary variable!
transit_under_1.5 = ifelse(nearest_transit <= 1.5, TRUE, FALSE)
) %>%
dplyr::select(TOTAL, starts_with('middle'), transit_under_1.5) %>%
st_drop_geometry() %>%
gather(variable, value, -TOTAL)
census_permits_test %>%
filter(TOTAL < 600) %>%
group_by(variable, value) %>%
summarise(mean_permits = mean(TOTAL)) %>%
ungroup() %>%
ggplot(aes(value,mean_permits)) +
geom_bar(position = "dodge", stat= "identity", fill = "#ffb600") +
facet_wrap(~variable, scales = "free") +
labs(x="Total Permits", y="Value",
title = "Middle percentile (45-55th) associations with number of permits", caption = "Figure 9") +
theme(legend.position = "none")

# Mutating only for the variables variables that mattered
census_permits <- census_permits %>%
mutate(middle_Percent_Owner_2022 = ifelse(Percent_Owner_2022 > quantile(Percent_Owner_2022, .45, na.rm = TRUE) & Percent_Owner_2022 <= quantile(Percent_Owner_2022, .55, na.rm = TRUE), TRUE, FALSE),
middle_Percent_2mort_NET = ifelse(Percent_2mort_NET > quantile(Percent_2mort_NET, .45, na.rm = TRUE) & Percent_2mort_NET <= quantile(Percent_2mort_NET, .55, na.rm = TRUE), TRUE, FALSE),
middle_rent_17 = ifelse(Med_Rent_2017 > quantile(Med_Rent_2017, .45, na.rm = TRUE) & Med_Rent_2017 <= quantile(Med_Rent_2017, .55, na.rm = TRUE), TRUE, FALSE),
transit_under_1.5 = ifelse(nearest_transit <= 1.5, TRUE, FALSE))
#Mapping cafes and vacancy:
# ggplot(census_permits) +
# geom_sf(aes(fill = cafe_count)) +
# labs(title = "Number of Cafes per Tract") +
# facet_wrap(~year) +
# mapTheme()
#
# ggplot(census_permits %>% filter(year == 2023)) +
# geom_sf(aes(fill = Percent_Vacant_2016)) +
# labs(title = "Vacancy Rate 2016") +
# mapTheme()
Correlation Matrix
We compared variables against each other in a correlation test. Many
of the variables correlated with each other more than with the count of
total permits. The most correlated variables with permit count are
vacancy with a positive correlation, wherein more vacancy means higher
permit issuance. After some additional analysis and feature engineering
however, we realize that close proximity to transit (within 1.5 mile) is
more significant than the nearest neighbor transit. Variables are of
course correlated with each other, but overall census variables are not
as correlated with permit count as expected.
numericVars <- census_permits %>%
st_drop_geometry() %>%
select(TOTAL, cafe_count, nearest_transit, job_count, Med_Inc_2022, Rent_Income_Ratio_NET,
White_Pop_2022, Med_Value_NET, Vacants_2022, Vacants_2017,
Vacants_PCT, Vacants_NET)
price.corr <- corr.test(numericVars)
if (sum(is.na(numericVars)) > 0) {
# Handle missing values (e.g., impute or remove)
numericVars <- na.omit(numericVars) # Remove rows with missing values
}
correlation_matrix <- round(cor(numericVars), 1)
p_values <- cor_pmat(numericVars)
ggcorrplot(
correlation_matrix,
p.mat = p_values,
colors = c( "#f20089", "white", "#2d00f7"),
type = "lower",
insig = "blank"
) +
labs(title = "Degrees of Correlation to Permits Issued", caption = "Figure 10") +
annotate(
geom = "rect",
xmin = .5, xmax = 11.5, ymin = .5, ymax = 1.5,
fill = "transparent", color = "#ffb600", alpha = 0.5
)

MODELING
For the baseline, we included census variables including median HH
income (CPI inflation adjusted), Median home value, Median rent, Median
year structure built, tenure, number of housing units, number of
vacancies, count of second mortgages, percent with a Bachelor’s degree,
and percent white population.
After doing our exploratory analysis, we narrowed down the variables
for our model. For the spatial and temporal lag model, we used these
variables, grouped into the following categories:
TIME/SPACE
- name (neighborhood)
- year
GENTRIFICATION INDICATORS
- indicator
(incorporates bachelors degrees, income price changes, rent changes, and
population)
DEMOGRAPHIC (the strongest)
- Bach_NET
-
middle_Percent_Owner_2022
- middle_Percent_2mort_NET
(other demographic variables) - Med_Value_NET
-
Percent_2mort_2022
- Percent_bach_NET
- Rent_Income_Ratio_2022
- White_Pop_NET
AMENITIES
- cafe_count
- job_count
-
transit_under_1.5
TIME LAG
- last_year_TOTAL
-
Med_Value_2017
- Total_Units_2017
- Vacants_2017
-
middle_rent_17
SPATIAL LAG
- Local Moran’s I
-
nn_LY_permits
- nn_bach_22
- nn_income_17
Several iterations of models were developed over the course of this
project, each attempting to isolate and coalesce variables in different
combinations to create as accurate a prediction as possible.
We used three cross-validation methods: random 10-fold
cross-validation, spatial cross-validation, and temporal
cross-validation. By cross validating, we can detect generalizability
& accuracy for the model’s predictions across time and space in
Philadelphia. Cross validation will also help determine whether our
model is overfitting. With overtfitting, the model is very accurate but
not generalizable to new datasets as new data becomes available and
input into the model.
set.seed(152)
census_permits_dat <- census_permits %>%
filter(year %in% c(2020, 2021, 2022)) %>%
mutate(cvID = sample(round(nrow(.) / 115.5),
size=nrow(.), replace = TRUE)) %>%
filter(name != "BYBERRY" & name != "WISSAHICKON_PARK" & name != "GLENWOOD" & name != "WISTER"
& name != "MCGUIRE")
# BASELINE
reg.baseline.vars <- c("year","indicator","name","Percent_White_2022","Percent_Owner_2022", "Percent_Vacant_2022", "Med_Inc_2022","Med_Value_2022","Med_Rent_2022", "Med_Structure_2022",
"Tot_Units_2022", "Vacants_2022", "Sec_Mort_2022", "Percent_bach_2022",
"Rent_Income_Ratio_2022",
"Bach_NET", "middle_Percent_Owner_2022","middle_Percent_2mort_NET",
"Med_Value_NET","Percent_bach_NET", "White_Pop_NET",
"cafe_count", "job_count", "transit_under_1.5")
reg.baseline.vars.space <- reg.baseline.vars[reg.baseline.vars != "name"]
reg.baseline.vars.time <- reg.baseline.vars[reg.baseline.vars != "year"]
# SPACE AND TIME - zero inflated
reg.st.vars <- c( "year", "indicator", "name","Bach_NET","middle_Percent_Owner_2022","middle_Percent_2mort_NET",
"Med_Value_NET", "Percent_2mort_2022", "Percent_bach_NET",
"Rent_Income_Ratio_2022", "White_Pop_NET", "cafe_count",
"transit_under_1.5", "last_year_TOTAL",
"Med_Value_2017", "Tot_Units_2017", "Vacants_2017",
"middle_rent_17", "Local_Morans_I", "nn_LY_permits",
"nn_bach_22", "nn_income_17", "jobs.nn")
reg.st.space <-reg.st.vars[reg.st.vars != "name"]
reg.st.time <- reg.st.vars[reg.st.vars != "year"]
# Baseline
reg.baseline.random <- crossValidate(
dataset = census_permits_dat,
id = "cvID",
dependentVariable = "TOTAL",
indVariables = reg.baseline.vars) %>%
dplyr::select(cvID = cvID, TOTAL, Prediction, name)
reg.baseline.cv.time <- crossValidate(
dataset = census_permits_dat,
id = "year",
dependentVariable = "TOTAL",
indVariables = reg.baseline.vars.time) %>%
dplyr::select(cvID = year, TOTAL, Prediction, name)
reg.baseline.cv.space <- crossValidate(
dataset = census_permits_dat,
id = "name",
dependentVariable = "TOTAL",
indVariables = reg.baseline.vars.space) %>%
dplyr::select(cvID = name, TOTAL, Prediction) %>%
mutate(name = cvID) %>%
dplyr::select(cvID, TOTAL, Prediction, name, geometry)
# space/time lag
reg.st.cv.random <- crossValidate(
dataset = census_permits_dat,
id = "cvID",
dependentVariable = "TOTAL",
indVariables = reg.st.vars) %>%
dplyr::select(cvID = cvID, TOTAL, Prediction, name)
reg.st.cv.time <- crossValidate(
dataset = census_permits_dat,
id = "year",
dependentVariable = "TOTAL",
indVariables = reg.st.time) %>%
dplyr::select(cvID = year, TOTAL, Prediction, name)
reg.st.cv.space <- crossValidate(
dataset = census_permits_dat,
id = "name",
dependentVariable = "TOTAL",
indVariables = reg.st.space) %>%
dplyr::select(cvID = name, TOTAL, Prediction) %>%
mutate(name = cvID) %>%
dplyr::select(cvID, TOTAL, Prediction, name, geometry)
reg.summary <-
rbind(
mutate(reg.baseline.random, Error = Prediction - TOTAL,
Regression = "Random k-fold CV: Baseline Variables"),
mutate(reg.st.cv.random, Error = Prediction - TOTAL,
Regression = "Random k-fold CV: Spatial and Temporal Lag"),
mutate(reg.baseline.cv.space, Error = Prediction - TOTAL,
Regression = "Spatial LOGO-CV: Baseline Variables"),
mutate(reg.st.cv.space, Error = Prediction - TOTAL,
Regression = "Spatial LOGO-CV: Spatial and Temporal Lag"),
mutate(reg.baseline.cv.time, Error = Prediction - TOTAL,
Regression = "Temporal LOGO-CV: Baseline Variables"),
mutate(reg.st.cv.time, Error = Prediction - TOTAL,
Regression = "Temporal LOGO-CV: Spatial and Temporal Lag")
) %>%
st_sf()
Explore Errors
Mean absolute error is quite low and similar across all cross
validation methods with only a few outliers. These outliers have
significant count differences, but the rest of the neighborhoods are
overal the same. The spatial and temporal lag model when cross validated
at first appears to be the strongest model with the lowest error, which
suggests that time and space lag factors are the most influential
predictors.
error_by_reg <-
reg.summary %>%
group_by(Regression) %>%
summarize(Mean_Error = mean(Error, na.rm = T),
MAE = mean(abs(Error), na.rm = T),
SD_MAE = sd(abs(Error), na.rm = T))
error_by_reg_nh <-
reg.summary %>%
group_by(Regression, name) %>%
summarize(Mean_Error = mean(Prediction - TOTAL, na.rm = T),
MAE = mean(abs(Prediction - TOTAL), na.rm = T),
SD_MAE = sd(abs(Prediction - TOTAL), na.rm = T)) %>%
ungroup()
## plot histogram of errors
error_by_reg_nh %>%
ggplot(aes(MAE)) +
geom_histogram(bins = 30, colour="black", fill = "#f20089") +
facet_wrap(~Regression, ncol = 2) +
labs(title="Distribution of MAE by Regression and Neighborhood",
x="Mean Absolute Error", y="Count")

Spatial Distribution of Errors
When we examine the temporal and spatial lag in further detail, it
becomes apparent that the neighborhoods with highest outlier in
predicted count are near the Fishtown and Southwest Philly
neighborhoods. Other than these, error is quite small throughout the
rest of Philadelphia.
reg.summary %>%
filter(Regression == "Temporal LOGO-CV: Spatial and Temporal Lag") %>%
group_by(name) %>%
st_drop_geometry() %>%
summarize(residuals = mean(Error, na.rm = T)) %>%
ungroup() %>%
left_join(phl.nh, by = c("name" = "name")) %>%
st_sf() %>%
ggplot() +
geom_sf(aes(fill = residuals)) +
scale_fill_gradient2(low = "#8900f2" , mid = "#ffb600", high = "#e500a4", name = "Error") +
labs(title = "Spatial and Temporal Lag: \nMean Absolute Error by Neighborhood", caption = "Figure 10") +
mapTheme()

Finally, when plotting predicted permits compared with observed
permits, we can see that the observed and predicted permits are fairly
close but not so much that the model is overfitting. It is therefore
generalizable and overall accurate.
reg.summary %>%
filter(Regression == "Temporal LOGO-CV: Spatial and Temporal Lag") %>%
ggplot(aes(TOTAL, Prediction)) +
geom_point() +
stat_smooth(aes(TOTAL, TOTAL),
method = "lm", se = FALSE, size = 1, colour="#ffb600") +
stat_smooth(aes(Prediction, TOTAL),
method = "lm", se = FALSE, size = 1, colour="#f20089") +
labs(title="Predicted permits as a function of observed permits",
subtitle="Orange line represents a perfect prediction; Pink line represents prediction", caption = "Figure 21") +
xlab("Actual Permits") +
ylab("Predicted Permits") +
plotTheme()

Our spatial regression model cross validates on a neighborhood level
across time, while our temporal regression cross validates year by year.
Regression error for the temporal logo cross validation with spatial and
temporal lag have the lowest mean absolute error as well as standard
deviation, which remains consistent with our other findings of that
model. It is clear that this model is generalizable, accurate, and
overall a good fit across measurements. Because cross validating on a
temporal scale appears to have the lowest mean MAE and Standard
Deviation, we can be fairly certain that time is key to predicting
permit issuance. This is also important since it means that additional
years of data will ideally be able to predict permit issuance with
future years of datasets.
st_drop_geometry(error_by_reg_nh) %>%
group_by(Regression) %>%
summarize(Mean_MAE = round(mean(MAE), 2),
SD_MAE = round(sd(MAE), 2)) %>%
kable() %>%
kable_styling("striped", full_width = F) %>%
row_spec(3, color = "black", background = "#8900f2") %>%
row_spec(4, color = "black", background = "#8900f2") %>%
row_spec(5, color = "black", background = "#ffb600") %>%
row_spec(6, color = "black", background = "#ffb600")
|
Regression
|
Mean_MAE
|
SD_MAE
|
|
Random k-fold CV: Baseline Variables
|
18.68
|
26.03
|
|
Random k-fold CV: Spatial and Temporal Lag
|
15.66
|
20.21
|
|
Spatial LOGO-CV: Baseline Variables
|
34.18
|
39.30
|
|
Spatial LOGO-CV: Spatial and Temporal Lag
|
25.90
|
45.36
|
|
Temporal LOGO-CV: Baseline Variables
|
17.26
|
22.59
|
|
Temporal LOGO-CV: Spatial and Temporal Lag
|
14.71
|
18.95
|
---
title: "Targeting Community Land Trust Interventions to Fight Displacement"
author: "Lizzie Shackney and Alexa Ringer"
date: "`r Sys.Date()`"
output: 
  html_document:
    theme: cosmo
    toc: true
    toc_float: true
    toc_depth: 3
    code_folding: "hide"
    code_download: true
editor_options:
  markdown:
    wrap: sentence
    
---

## **INTRODUCTION**

Where is development going to happen? Development is critical to accommodate growth and change, to provide housing, infrastructure, amenities, employment, and community spaces for economic prosperity and affordable shelter. But development has its downsides, too. When investments in neighborhoods drive up prices and the cost of living, it can – either directly or indirectly – force out long-term residents who are more vulnerable to cost changes. Understanding how economic development can equitably benefit residents rather than just push them to the urban periphery can help ensure that urban policy benefits city’s most vulnerable residents. Understanding patterns of and predicting development is key to this, and it's challenging.  

A key to understanding patterns of development is understanding [endogenous gentrification]( https://urbanspatialanalysis.com/portfolio/predicting-gentrification-using-longitudinal-census-data/). Endogenous gentrification describes the in-migration of wealthier residents to adjacent, poorer neighborhoods, causing prices to rise and lower-income residents to be priced out. We incorporate endogenous gentrification as well as other indicators of gentrification, neighborhood amenities and infrastructure to predict which areas of Philadelphia are most likely to experience new development. If tenants can understand how far into the future development pressure is likely to provide gentrification pressure and where, they can better target anti-displacement strategies. This application will help non-profit and volunteer-based housing justice organizations maximize tenant organizing capacity and resources for anti-displacement by directing policy makers where to invest housing trust funds & community land trusts.

### Use Case
Community land trusts are an [often-cited strategy for preventing displacement]( https://smartgrowthamerica.org/resources/strategies-to-minimize-displacement-community-land-trust/#:~:text=Over%20time%2C%20community%20land%20trusts,real%20estate%20values%20is%20reduced.). A community land trust is a long-term strategy to preserve affordable housing by putting the land of property in a community-controlled trust and only selling the structure. By separating the land and property from the structure (as well as imposing deed restrictions for affordability including a predetermined re-sale formulate) guarantees affordability over the long term. Affordability measure can help low-income residents build equity through homeownership and retain affordable housing for future residents. However, community land trusts [lack sufficient funding]( https://www.sightline.org/2021/08/23/how-can-cities-move-the-needle-on-community-land-trusts/) to scale up to meet affordable housing needs when demand is high across the board **With limited resources and widespread need, targeting land trust funding is critical to prevent displacement.**

![Source: Virginia Statewide Community Land Trust, https://www.vsclt.org/faq/](CLT_Model.png)

### Website Mockup

include photos of our hypothetical website application that could be used as a reuslt of this project. 

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE) 
options(scipen=999)

colorize <- function(x, color) {
  if (knitr::is_latex_output()) {
    sprintf("\\textcolor{%s}{%s}", color, x)
  } else if (knitr::is_html_output()) {
    sprintf("<span style='color: %s;'>%s</span>", color,
      x)
  } else x
}

library(MASS)
library(ggplot2)
library(tidyverse)
library(sf)
library(spdep)
library(caret)
library(ckanr)
library(ggcorrplot) # plot correlation plot
library(sfdep)
library(tidycensus)
library(httr)
library(lubridate)
library(rphl)
library(geojsonio)
library(leaflet)
library(lehdr)
library(kableExtra)
library(knitr)
library(DT)
library(tigris)
library(psych)
library(osmdata)
library(tigris)
library(leaflet)
library(FNN)
library(grid)
library(gridExtra)
library(pscl)
library(mpath)
library(DT)


source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

#palette

palette1 <- c("#2d00f7", "#6a00f4", "#8900f2", "#bc00dd", "#e500a4", "#f20089", "#ffb600")

#Load neighborhood data

phl.nh <- st_read("https://raw.githubusercontent.com/azavea/geo-data/master/Neighborhoods_Philadelphia/Neighborhoods_Philadelphia.geojson")%>%
    st_transform(crs=4326) %>%
  dplyr::select(name, mapname)

phl.cd <- st_read("https://opendata.arcgis.com/datasets/9298c2f3fa3241fbb176ff1e84d33360_0.geojson")%>%
    st_transform(crs=4326) 

# Load permit data

# # URL

# carto_url = "https://phl.carto.com/api/v2/sql"
# 
# # permit data
# table_name = "permits"
# 
# # query
# where = "commercialorresidential = 'RESIDENTIAL'"
# 
# query = paste("SELECT *, ST_Y(the_geom) AS lat, ST_X(the_geom) AS lng",
#               "FROM", table_name,
#               "WHERE", where)
# 
# res_permits = rphl::get_carto(query, format = "csv", base_url = carto_url, stringsAsFactors = F)

res_permits <- read.csv("Data/res_permits_11.25.csv")

```

## **METHODOLOGY**

Using Philadelphia permit data as a case study, we created a Poisson model to predict where development is most likely to occur. We developed a spatial lag for Philadelphia neighborhoods to predict new construction permit issuance. One [study](https://researchrepository.wvu.edu/cgi/viewcontent.cgi?article=9005&context=etd) examined Washington D.C., weighing permit patterns as potential indicators of development. This study identified and assigned weights to the following permit types:
- Addition/Addition, Alteration, and Repair/Alteration and Repair
- Electric/Electric General/Electric Heavy Up
- Plumbing/Plumbing and Gas
- Raze
- New Building
- Demolition
- Mechanical
- Certificate of Occupancy

We examined the Philadelphia permit dataset and filtered for residential construction permits. We then  looked at population level data [indicators of gentrification](https://drexel.edu/uhc/resources/briefs/Measure-of-Gentrification-for-Use-in-Longitudinal-Public-Health-Studies-in-the-US/) such as **vacancy, residents with bachelor’s degrees, changes in rents, changes in home values, and median income** to assign an indicator to each Census tract. In areas with decreasing vacancy and increasing bachelor’s degrees & incomes, we have used these as gentrification indicators independent of development. We then examined which permit types are most strongly associated with gentrification or intense gentrification, particularly where there is a large difference in the mean number of permits in gentrifying areas as opposed to non-gentrifying or ineligible areas (where income and home values are already high).

We engineered other features including proximity to desirable amenities, such as public transit, cafes, and jobs. We create spatial and temporal lags for many of our demographic variables, looking at the values of neighboring census tracts and the values from the prior (2017) 5-year American Community Survey.

We then run six Poisson linear regressions using two different sets of variables and three cross-validation strategies. We examine the errors in detail to identify the strongest model.

## **DEPENDENT VARIABLE**

In this section, we investigate the dependent variable prior to any filtering.

Our dependent variable is the number of permits issued per census tract. The unit of analysis is permits per census tract because of census data availability and the fact that federal and state funding often flows to areas via census tract or zip code eligibility. We time and or space to see if there there are any patterns to missing permit data so as to better determine which years to test and train on. 

```{r}
# See if there are any patterns to missing spatial data

res_permits %>%
  group_by(year = year(as.Date(permitissuedate))) %>%
  summarise_all(funs(sum(is.na(.)))) %>%
datatable(options = list(pageLength = 12),
            style = "auto")

```

When looking at council district vs. year distribution of permit data, there are plenty of permits in each district, but wide differences by year. Permitting activity in earlier years is much lower than 2020 to 2023. Although 2020 is typically considered an anomaly, permit issuance and development was strong due to low interest rates. Therefore, we trained 2020 and 2021 permit issuance data to test on 2022 development activity. As a result, we got rid of missing spatial data

```{r}
res_permits %>%
  group_by(council_district) %>%
  summarise(count = n()) %>%
  ggplot(aes(y = as.factor(council_district), x = count)) +
  geom_bar(stat = "identity", fill = "#ffb600") + 
  plotTheme() + 
  labs(title = "Permits by Council District", caption = "Figure 1")

res_permits %>%
  group_by(year = year(as.Date(permitissuedate))) %>%
  st_drop_geometry() %>%
  summarise(count = n()) %>%
  ggplot(aes (x = year, y = count)) +
  geom_bar(stat = "identity", fill = "#ffb600") +
  plotTheme() + 
  labs(title = "Permits by Year", caption = "Figure 2")

res_permits <- res_permits %>%
  filter(!is.na(lng) & !is.na(lat)) %>%
  st_as_sf(coords = c("lng","lat"),crs=4326)

```

Next, we examined permit types and counts to understand the distribution of permit types specific to Philadelphia. When looking at the count of permit types, as represented below by n, we can see that the permit type alone does not indicate what type of work is being done. New construction will necessitate different permit types, such as plumbing, mechanical, or electrical permits, but these can also be counted in regular maintenance. Therefore, filtering only based on new construction may be misleading. We looked at census data next to begin understanding how permit data is associated with demographics and our gentrification indicator. 

```{r warning = FALSE, message = FALSE, }
res_permits %>%
  group_by(permitdescription, typeofwork) %>%
  st_drop_geometry() %>%
  summarise(n = n()) %>%
  arrange(permitdescription, desc(n)) %>%
  datatable(options = list(pageLength = 10),
            style = "auto")



```

```{r results = 'hide'}
phl_tracts4326 <- tracts(state = "PA",
                     county = "Philadelphia",
                     year = "2020") %>%
  select(GEOID, geometry) %>%
  st_set_crs(4326)

res_permits2 <- res_permits %>%
  group_by(permitdescription, typeofwork) %>%
  mutate(n = n()) %>%
  ungroup() %>%
  select(permittype, permitdescription, typeofwork, geometry, permitissuedate) %>%
  #join to census tract
  st_join(phl_tracts4326)
```

When we look at number of permits by type per tract as seen below, it becomes apparent there are a very high concentration of all permits in the Fishtown, Port Richmond area. When we look by year, we see that some of this is because of data unavailability in earlier years followed by a flux of permit issuance from 2020 to 2022. 

So what happened these years? 

- Interest rates were at a historical low in 2020 to 2022, in part to entice economic activity to be as enticing as possible after the COVID-19 pandemic brought the economy to a screeching halt. 
![Average interest Rates 2017 - 2023, Source: https://take-profit.org/en/statistics/interest-rate/united-states/](interest-rates.png)

- More locally, Philadelphia passed a tax policy in 2000 to provide a tax abatement on new housing construction until [it expired in 2022](https://www.inquirer.com/news/philadelphia/inq2/philadelphia-property-tax-abatement-neighborhood-maps-data-20230214.html). The policy goal was to spur development for residential property owners by providing a 10-year tax abatement on the value of improvements. This partially explains the time lag of some development, but it remains unclear why the development was so concentrated spatially. 

Our next data analysis step is to drop the year and look only at permits in and after 2018. It is likely that more permits are issued in areas where these changes are already happening. 

```{r}

# PICK 1

tract_permits_year <- res_permits2 %>%
  st_drop_geometry() %>%
  mutate(permit = paste(permitdescription, "-", typeofwork),
         year = year(as.Date(permitissuedate))) %>%
  select(-permitissuedate, -permitdescription, -permittype, -typeofwork) %>%
  gather(Variable, value, -GEOID, -year) %>%
  dplyr::count(Variable, value, GEOID, year) %>%
  group_by(GEOID, year) %>%
  mutate(TOTAL = sum(n)) %>%
  ungroup() %>%
  spread(key = value, value = n) %>%
  select(-Variable) %>%
  replace(is.na(.), 0)

#total permits
tract_permits_year %>%
  left_join(phl_tracts4326) %>%
  st_sf() %>%
  filter(year > 2014 & year < 2023) %>%
  ggplot()+
  geom_sf(aes(fill = cut(TOTAL, breaks = 8))) +
  scale_fill_manual(values = palette1) +
  mapTheme() +
  labs(title = "Permits between 2014 - 2023", caption = "Figure 3")


#by year - there are a lot of tracts missing in earlier years
# tract_permits_year %>%
#   left_join(phl_tracts4326) %>%
#   st_sf() %>%
#   filter(year > 2014 & year <= 2023) %>%
#   ggplot()+
#   geom_sf(aes(fill = TOTAL)) +
#   scale_fill_binned(n.breaks=8) +
#   facet_wrap(~year) +
#   mapTheme()

```



```{r}
tract_permits <- res_permits2 %>%
  st_drop_geometry() %>%
  mutate(permit = paste(permitdescription, "-", typeofwork),
         year = year(as.Date(permitissuedate))) %>%
  filter(year >= 2018) %>%
  select(-permitissuedate, -permitdescription, -permittype, -typeofwork, -year) %>%
  gather(Variable, value, -GEOID) %>%
  dplyr::count(Variable, value, GEOID) %>%
  group_by(GEOID) %>%
  mutate(TOTAL = sum(n)) %>%
  ungroup() %>%
  spread(key = value, value = n) %>%
  select(-Variable) %>%
  replace(is.na(.), 0)

```

## **PERMIT SELECTION**

### Census Data

But a key consideration is: [does demand lead supply](https://journals.sagepub.com/doi/full/10.1177/0042098020940596) or vice versa? Which comes first, development or gentrification?

There are two theories of neighborhood change as it relates to this. 
- 1) Housing demand of middle-class, mostly white residents encourages them to leave neighborhoods to seek out more affordable neighborhoods. 
- 2) Development supply of new construction in historically poorer neighborhoods encourages an influx of new residents, who are encouraged to move to a neighborhood by the supply of housing. 

The above study found that housing market growth (i.e. new permit issuance) did not predict future increase in gentrification; rather, development followed neighborhood change. Gentrification features themselves could be considered indicators of future development. 

As a result of permit analysis, we selected permits that were associated with gentrification indicators. 
Using [prior research on gentrification indicators](https://researchrepository.wvu.edu/cgi/viewcontent.cgi?article=9005&context=etd) and using publicly available census data, we examined the following demographic characteristics: 

- All Census tracts in Philadelphia, 5Y ACS for 2021 and for 2016
- Median HH income, CPI-inflation adjusted
- Median home value, CPI-inflation adjusted
- Median rent, CPI-inflation adjusted
- Median year structure built
- Tenure - owner-occupied
- Number of housing units
- Number of vacancies
- Count of second mortgages
- Percent white population
- Percent with Bachelor's degree

We calculated: rent to income ratio, net changes of multiple indicators from 2017 to 2022, and percent changes from 2017 to 2022. Median rents, values, and incomes were adjusted to 2022 dollars.

```{r results = 'hide'}

CPI2017 <- 242.839
CPI2022 <- 281.148

CPI <- CPI2022/CPI2017

# NOTE: currently replacing NA values with a very small value...

phlCensus22 <- 
  get_acs(geography = "tract", 
          variables = c("B01003_001", "B19013_001", 
                        "B02001_002", "B25077_001",
                        "B25064_001", "B25035_001",
                        "B25003_001", "B25003_002",
                        "B25001_001", "B25002_001",
                        "B25002_003", "B25081_001",
                        "B25081_004", "B15003_001",
                        "B15003_022"
                        ), 
          year = 2022, 
          dataset = "acs5",
          state = "PA", 
          geometry = FALSE, 
          county = c("Philadelphia"),
          output = "wide") %>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         White_Pop = B02001_002E,
         Med_Value = B25077_001E,
         Med_Rent = B25064_001E,
         Med_Structure = B25035_001E,
         Owner_univ = B25003_001E,
         Owners = B25003_002E,
         Tot_Units = B25001_001E,
         Vac_univ = B25002_001E,
         Vacants = B25002_003E,
         Mort_status_univ = B25081_001E,
         Sec_Mort = B25081_004E,
         Bach_univ = B15003_001E,
         Bach = B15003_022E
         ) %>%
  dplyr::select(Total_Pop, Med_Inc, White_Pop,
                Med_Value, Med_Rent, Med_Structure, 
                Owner_univ, Owners, Tot_Units,
                Vac_univ, Vacants, Mort_status_univ, 
                Sec_Mort, Bach_univ, Bach,  GEOID) %>%
  mutate(Percent_White = White_Pop / Total_Pop,
         Percent_Owner = Owners / Owner_univ,
         Percent_Vacant = Vacants / Vac_univ,
         Percent_2mort = Sec_Mort / Mort_status_univ,
         Percent_bach = Bach / Bach_univ,
         Rent_Income_Ratio = Med_Rent / (Med_Inc/12)
         ) %>%
  #replace(is.na(.), .00000001) %>%
  select(-ends_with("_univ"))

phlCensus17 <- 
  get_acs(geography = "tract", 
          variables = c("B01003_001", "B19013_001", 
                        "B02001_002", "B25077_001",
                        "B25064_001", "B25035_001",
                        "B25003_001", "B25003_002",
                        "B25001_001", "B25002_001",
                        "B25002_003", "B25081_001",
                        "B25081_004", "B15003_001",
                        "B15003_022"
                        ), 
          year = 2017, 
          state = "PA", 
          geometry = FALSE, 
          county=c("Philadelphia"),
          output = "wide") %>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         White_Pop = B02001_002E,
         Med_Value = B25077_001E,
         Med_Rent = B25064_001E,
         Med_Structure = B25035_001E,
         Owner_univ = B25003_001E,
         Owners = B25003_002E,
         Tot_Units = B25001_001E,
         Vac_univ = B25002_001E,
         Vacants = B25002_003E,
         Mort_status_univ = B25081_001E,
         Sec_Mort = B25081_004E,
         Bach_univ = B15003_001E,
         Bach = B15003_022E
         ) %>%
  dplyr::select(Total_Pop, Med_Inc, White_Pop,
                Med_Value, Med_Rent, Med_Structure, 
                Owner_univ, Owners, Tot_Units,
                Vac_univ, Vacants, Mort_status_univ, 
                Sec_Mort, Bach_univ, Bach, GEOID) %>%
  mutate(
    Med_Rent = Med_Rent * CPI,
    Med_Value = Med_Value * CPI,
    Percent_White = White_Pop / Total_Pop,
         Percent_Owner = Owners / Owner_univ,
         Percent_Vacant = Vacants / Vac_univ,
         Percent_2mort = Sec_Mort / Mort_status_univ,
                  Percent_bach = Bach / Bach_univ,
         Rent_Income_Ratio =  Med_Rent / (Med_Inc/12),
         ) %>%
  #replace(is.na(.), .00000001) %>%
    select(-ends_with("_univ"))

# Interpolate based on population in order to translate 2017 tracts to 2020 tracts
# workaround because tidycensus wasn't downloading geometry

#get just population from phlCensus
pop_22 <- phlCensus22 %>%
  select(GEOID, Total_Pop)

#get tracts; join population to tracts
phl_tracts22 <- tracts(state = "PA",
                     county = "Philadelphia",
                     year = "2022") %>%
  left_join(pop_22) %>%
  st_sf()

#join tracts to Census data
phlCensus22 <- phlCensus22 %>%
  left_join(phl_tracts22 %>%
              select(GEOID, geometry)) %>%
  st_sf()

#get 2017 tracts
phl_tracts17 <- tracts(state = "PA",
                     county = "Philadelphia",
                     year = "2017")

#join tracts to 2017 Census data
phlCensus17 <- phlCensus17 %>%
  left_join(phl_tracts17 %>%
              select(GEOID,geometry)) %>%
  st_sf()

sf_use_s2(FALSE)

census17_interpolate_pw <- interpolate_pw(
  phlCensus17,
  phlCensus22,
  to_id = "GEOID",
  extensive = TRUE, 
  weights = phl_tracts22,
  weight_column = "Total_Pop",
  crs = 4269)

census_22_and_comparison <- phlCensus22 %>%
  left_join(st_drop_geometry(census17_interpolate_pw), 
            by = "GEOID",
            suffix = c("_2022", "_2017")) %>%
  mutate(Total_Pop_NET = Total_Pop_2022 - Total_Pop_2017, 
         Med_Inc_NET = Med_Inc_2022 - Med_Inc_2017, 
         White_Pop_NET = White_Pop_2022 - White_Pop_2017,
         Med_Value_NET = Med_Value_2022 - Med_Value_2017, 
         Med_Rent_NET = Med_Rent_2022 - Med_Rent_2017, 
         Owners_NET = Owners_2022 - Owners_2017, 
         Tot_Units_NET = Tot_Units_2022 - Tot_Units_2017,
         Vacants_NET = Vacants_2022 - Vacants_2017, 
         Sec_Mort_NET = Sec_Mort_2022 - Sec_Mort_2017,
         Percent_White_NET = Percent_White_2022 - Percent_White_2017,
         Percent_Owner_NET = Percent_Owner_2022 - Percent_Owner_2017,
         Percent_Vacant_NET = Percent_Vacant_2022 - Percent_Vacant_2017,
         Percent_2mort_NET = Percent_2mort_2022 - Percent_2mort_2017,
         Rent_Income_Ratio_NET = Rent_Income_Ratio_2022 - Rent_Income_Ratio_2017,
         Bach_NET = Bach_2022 - Bach_2017,
         Percent_bach_NET = Percent_bach_2022 - Percent_bach_2017,
         
         Total_Pop_PCT = ifelse(Total_Pop_2017>0, (Total_Pop_NET / Total_Pop_2017), NA)*100, 
         Med_Inc_PCT = ifelse(Med_Inc_2017 >0, (Med_Inc_NET / Med_Inc_2017), NA)*100, 
         White_Pop_PCT = ifelse(White_Pop_2017 >0, (White_Pop_NET / White_Pop_2017), NA)*100,
         Med_Value_PCT = ifelse(Med_Value_2017>0, (Med_Value_NET / Med_Value_2017), NA)*100, 
         Med_Rent_PCT = ifelse(Med_Rent_2017>0, (Med_Rent_NET / Med_Rent_2017),NA)*100, 
         Owners_PCT = ifelse(Owners_2017>0, ( Owners_NET / Owners_2017),NA)*100, 
         Tot_Units_PCT = ifelse(Tot_Units_2017>0, (Tot_Units_NET / Tot_Units_2017),NA)*100,
         Vacants_PCT = ifelse(Vacants_2017>0,(Vacants_NET / Vacants_2017),NA)*100, 
         Sec_Mort_PCT = ifelse(Sec_Mort_2017>0,(Sec_Mort_NET / Sec_Mort_2017),NA)*100,
         Percent_White_PCT =ifelse(Percent_White_2017>0,(Percent_White_NET / Percent_White_2017),NA)*100,
         
         # is it bad to calculate percent change using percentages?
         
         Percent_Owner_PCT = ifelse(Percent_Owner_2017>0,(Percent_Owner_NET / Percent_Owner_2017),NA)*100,
         Percent_Vacant_PCT = ifelse(Percent_Vacant_2017>0,(Percent_Vacant_NET / Percent_Vacant_2017),NA)*100,
         Percent_2mort_PCT = ifelse(Percent_2mort_2017>0,(Percent_2mort_NET / Percent_2mort_2017),NA)*100,
         
         Rent_Income_Ratio_PCT = ifelse(Rent_Income_Ratio_2017>0, (Rent_Income_Ratio_NET / Rent_Income_Ratio_2017),NA)*100,
         
         Percent_bach_PCT = ifelse(Percent_bach_2017>0, (Percent_bach_NET / Percent_bach_2017),NA)*100) %>%
  
  relocate(GEOID)

```

### Gentrification Indicator

We developed a gentrification indicator methodology based on [a Drexel study](https://drexel.edu/uhc/resources/briefs/Measure-of-Gentrification-for-Use-in-Longitudinal-Public-Health-Studies-in-the-US/). This is just one possible indicator and could be improved upon or changed by incorporating additional demographic variables. 

These four gentrification categories we use to characterize tracts are:

* Ineligible
* Not gentrifying
* Gentrifying
* Intensely Gentrifying

We can then see for which characterizations there are differences in permit types. If the mean number of permits for census tracts that are gentrifying or intensely gentrifying is larger than for tracts that are ineligible or not gentrifying, we keep that permit type. 


```{r results = 'hide'}

indicator <- census_22_and_comparison %>%
  mutate(indicator = 
           case_when(
             
             # tracts with < 50 people or in top quartile of median income are excluded
             
             Total_Pop_2022 <= 50 | Med_Inc_2017 > quantile(Med_Inc_2017, .75, na.rm = TRUE) ~ "Ineligible",
         
             # tracts with below median increase in pct with bachelor's degree,
             # below median changes in rent and home values are not gentrifying
             Percent_bach_PCT < median(Percent_bach_NET, na.rm = TRUE) | 
               (Med_Value_NET < median(Med_Value_NET, na.rm = TRUE) &
                  Med_Rent_NET < median(Med_Rent_NET, na.rm = TRUE)) ~ "Not gentrifying",
         
            # tracts with above-median percent increases in bachelor degree attainment
            # and EITHER b/t .5 and .75 percentile increase in home values or rents = gentrifying
         
         Percent_bach_PCT >= median(Percent_bach_NET, na.rm = TRUE) & (
           (Med_Value_NET >= quantile(Med_Value_NET, .50, na.rm = TRUE) &
              Med_Value_NET < quantile(Med_Value_NET, .75, na.rm = TRUE)) |
             (Med_Rent_NET >= quantile(Med_Rent_NET, .50, na.rm = TRUE) & 
                Med_Rent_NET < quantile(Med_Rent_NET, .75, na.rm = TRUE))
           
         ) ~ "Gentrifying",
         
         # tracts with above-median percent increases in bachelor degree attainment
            # and EITHER in .75 percentile increase in home values or rents = intensely gentrifying
         
         Percent_bach_PCT >= median(Percent_bach_NET, na.rm = TRUE) & 
           (Med_Value_NET >= quantile(Med_Value_NET,.75, na.rm = TRUE) | 
              Med_Rent_NET >= quantile(Med_Rent_NET,.75, na.rm = TRUE)
           )
          
          ~ "Intensely gentrifying")
  ) %>%
  
  select(GEOID, indicator, Total_Pop_2022, Total_Pop_2017, 
         Med_Inc_2022, Med_Inc_2017,
         Percent_bach_NET, Med_Value_NET, Med_Rent_NET) 

```

### Permit Selection

We can see below how each permit type relates to each gentrification indicator, and we can also calculate the net difference and percent difference between the number of permits in not gentrifying tracts versus intensely gentrifying tracts. 

```{r}
indicator_table <- indicator %>% 
    dplyr::select(-Total_Pop_2022, -Total_Pop_2017, 
         -Med_Inc_2022, -Med_Inc_2017,
         -Percent_bach_NET, -Med_Value_NET, -Med_Rent_NET) %>%
    left_join(tract_permits) %>%
  st_drop_geometry() %>%
  select(-GEOID) %>%
  gather(Variable, value, -indicator) %>%
  group_by(indicator, Variable) %>%
  summarise_if(is.numeric, mean, na.rm = TRUE) %>%
  ungroup() %>%
  spread(key = indicator, value = value) %>%
  mutate_if(is.numeric, round, digits = 3) %>%
  select(Variable, Ineligible, `Not gentrifying`, Gentrifying, `Intensely gentrifying`) 

keep <- indicator_table %>%
  mutate(g_to_non_g_PCT = round((Gentrifying - `Not gentrifying`) / `Not gentrifying`, 4) *100,
         ig_to_non_g_PCT = round((`Intensely gentrifying` - `Not gentrifying`) / `Not gentrifying`, 4) *100) %>%
  filter(Gentrifying >= 1 & `Intensely gentrifying`>=1 & g_to_non_g_PCT > 0 & ig_to_non_g_PCT > 0) %>%
  select(-g_to_non_g_PCT, -ig_to_non_g_PCT)

keep %>%
  datatable()

keep <- unique(keep$Variable) 

```

Here, we filter only for permits where there is a non-zero percent difference between intensely gentrifying tracts and non-gentrifying tracts, and keep only those permits and attach that to census data, so each tract is assigned an indicator for whether or not it is gentrifying. 

```{r, results = 'hide'}

tract_permits_final <- res_permits2 %>%
  st_drop_geometry() %>%
  mutate(permit = paste(permitdescription, "-", typeofwork),
         year = year(as.Date(permitissuedate))) %>%
  filter(year >= 2018 & permit %in% keep) %>%
  select(-permitissuedate, -permitdescription, -permittype, -typeofwork) %>%
  gather(Variable, value, -GEOID, -year) %>%
  dplyr::count(Variable, value, GEOID, year) %>%
  group_by(GEOID, year) %>%
  mutate(TOTAL = sum(n)) %>%
  ungroup() %>%
  spread(key = value, value = n) %>%
  select(GEOID, year, TOTAL) %>%
  replace(is.na(.), 0) 

census_permits <- 
  expand.grid(GEOID = unique(tract_permits_final$GEOID),
              year=unique(tract_permits_final$year)) %>%
  left_join(indicator %>% select(GEOID, indicator)) %>%
  left_join(tract_permits_final) %>%
  left_join(census_22_and_comparison %>% st_drop_geometry()) %>%
  mutate(TOTAL = ifelse(is.na(TOTAL), 0, TOTAL)) %>%
  filter(Total_Pop_2022 > 0) %>% 
  select(-geometry) %>%
  left_join(phl_tracts4326) %>%
  st_as_sf()
  
census_permits$indicator <- factor(census_permits$indicator, ordered = TRUE, levels = c("Ineligible", "Not gentrifying", "Gentrifying", "Intensely gentrifying"))

```

### Exploratory Analysis of Filtered Dependent Variable 

Now that we have filtered what will ultimately become our dependent variable, we can see that there are *many* census tracts where there are no permits issued across all years. 

```{r}

ggplot(census_permits) +
  geom_histogram(aes(x = TOTAL), bins = 40, fill = "#ffb600") +
  labs(title = "Permit Distribution", caption = "Figure 4")

```

There was a huge spike of permits issued in 2020 citywide, likely spurred by the announced [rollback of the tax abatement period](https://www2.law.temple.edu/10q/philadelphia-city-council-enacts-sweeping-changes-to-the-tax-abatement-program/#:~:text=In%20Philadelphia%2C%20residential%20properties%20are,filed%20after%20December%2021%2C%202020.) and the historically low interest rates. 
As a result, [2021 was a major permit boom year](https://www.inquirer.com/real-estate/housing/housing-development-permits-abatement-philadelphia-decline-20221207.html) and [December 2021 experienced the the highest count of new construction permits issued since January 2016](https://www.economyleague.org/resources/inside-philadelphias-imminent-construction-boom). The [boom continued in 2022](https://whyy.org/articles/philly-is-on-track-to-build-a-record-number-of-apartments-in-2022-analyst-says/) and news articles throughout the years commented on the ["absurd" number of new consruction permits issued](https://www.bisnow.com/philadelphia/news/construction-development/philadelphia-construction-boom-2021-tax-abatement-111874) in Philadelphia. 


```{r}
census_permits %>%
  group_by(year) %>%
  summarise(TOTAL = sum(TOTAL)) %>%
  filter(!is.na(year)) %>%
  st_drop_geometry() %>%
  ggplot(aes (x = year, y = TOTAL)) +
  geom_bar(stat = "identity", fill = "#ffb600") +
  plotTheme() +
  labs(title = "Permits by Year", caption = "Figure 5")
```

National and city-wide policies may be influencing development on a city scale. Gentrification indicators provide a clue as to which tracts are experiencing change, or very rapid change, if at all. But we haven't yet honed in on neighborhoods to guide our feature engineering. We looked at permits by neighborhood for some clues and could immediately see that there were high permits issued for neighborhoods that have been the subject of locla articles about gentrification, including [Fishtown](https://www.forbes.com/sites/petertaylor/2018/05/02/how-fishtown-philadelphia-became-americas-hottest-new-neighborhood/?sh=14965d8732e5), [Point Breeze](https://www.inquirer.com/news/point-breeze-tioga-nicetown-gentrification-philadelphia-census-neighborhood-change-20200109.html), and [Lower Kensington](https://www.inquirer.com/real-estate/inga-saffron/kensington-gentrification-shift-capital-drugs-opioids-impact-services-neighborhood-development-housing-maken-fels-fund-inga-saffron-20210210.html)

```{r}
res_permits3 <- res_permits %>%
  group_by(permitdescription, typeofwork) %>%
  mutate(n = n()) %>%
  ungroup() %>%
  select(permittype, permitdescription, typeofwork, geometry, permitissuedate) %>%
  st_join(phl.nh) %>%
    mutate(permit = paste(permitdescription, "-", typeofwork),
         year = year(as.Date(permitissuedate))) %>%
  filter(year>=2018 & permit %in% keep)

res_permits3 <- res_permits3 %>%
  st_drop_geometry() %>%
  group_by(mapname, year) %>%
  summarise(n = n()) %>%
  arrange(desc(n))

res_permits3 %>%
  datatable()
```

When we look at this over time, we can see clearly that the neighborhood patterns remain the same, with the same neighborhoods receiving permits across a 6-year period. 

```{r}

res_permits3 <- res_permits3 %>%
  left_join(phl.nh)

res_permits3 %>%
  ungroup() %>%
  st_as_sf() %>%
  ggplot() +
  geom_sf(aes(fill = cut(n, breaks = 7))) +
  scale_fill_manual(values = palette1) +
  facet_wrap(~year) +
  mapTheme() +
  labs(title = "Permits by Year, Philadelphia", caption = "Figure 6")


```

Now that we have explored gentrification indicators across Philadelphia neighborhoods and tracts, permits issued over time and space, and some basic demographic characteristics, we developed a composite map where we can see which neighborhoods have gentrification indicators and are seeing a lot of new development for a given year. *Click through the years to look at how gentrification indicators and permits change over time.*

In 2021, there are area that are gentrifying but not seeing a lot of new development - particularly in the Northeast and Northwest. Predicting development is therefore going to be slightly different than predicting changes in rents, home values, and other demographic characteristics like racial makeup. 

```{r}
bins = c(0,5,30,50,100,200,300,Inf)

permits18 <- colorBin(
  palette = "viridis",
  bins = bins,
  domain = census_permits %>% filter(year == 2018) %>% .$TOTAL)

permits19 <- colorBin(
  palette = "viridis",
  bins = bins,
  domain = census_permits %>% filter(year == 2019) %>% .$TOTAL)

permits20 <- colorBin(
  palette = "viridis",
  bins = bins,
  domain = census_permits %>% filter(year == 2020) %>% .$TOTAL)

permits21 <- colorBin(
  palette = "viridis",
  bins = bins,
  domain = census_permits %>% filter(year == 2021) %>% .$TOTAL)

permits22 <- colorBin(
  palette = "viridis",
  bins = bins,
  domain = census_permits %>% filter(year == 2022) %>% .$TOTAL)

permits23 <- colorBin(
  palette = "viridis",
  bins = bins,
  domain = census_permits %>% filter(year == 2023) %>% .$TOTAL)

indicator_pal <- colorFactor(palette = c("#c191d9", "#6d5480","#ff66f0", "#fccd32"),
                         domain = census_permits$indicator)

permits_map <- leaflet() %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  addPolygons(data = census_permits %>% filter(year == 2023),
              stroke = TRUE,
              color = ~indicator_pal(indicator),
              weight = 2,
              opacity = 1,
              fillOpacity = .5,
              fillColor = ~permits23(TOTAL),
              label = paste("Total Permits:", census_permits %>% filter(year == 2023) %>% .$TOTAL),
              group = "2023") %>%
      addPolygons(data = census_permits %>% filter(year == 2022),
              stroke = TRUE,
              color = ~indicator_pal(indicator),
              weight = 2,
              opacity = 1,
              fillOpacity = .5,
              fillColor = ~permits22(TOTAL),
              label = paste("Total Permits:", census_permits %>% filter(year == 2022) %>% .$TOTAL),
              group = "2022") %>%
  addPolygons(data = census_permits %>% filter(year == 2021),
              stroke = TRUE,
              color = ~indicator_pal(indicator),
              weight = 2,
              opacity = 1,
              fillOpacity = .5,
              fillColor = ~permits21(TOTAL),
              label = paste("Total Permits:", census_permits %>% filter(year == 2021) %>% .$TOTAL),
              group = "2021") %>%
  addPolygons(data = census_permits %>% filter(year == 2020),
              stroke = TRUE,
              color = ~indicator_pal(indicator),
              weight = 2,
              opacity = 1,
              fillOpacity = .5,
              fillColor = ~permits20(TOTAL),
              label = paste("Total Permits:", census_permits %>% filter(year == 2020) %>% .$TOTAL),
              group = "2020") %>%
    addPolygons(data = census_permits %>% filter(year == 2019),
              stroke = TRUE,
              color = ~indicator_pal(indicator),
              weight = 2,
              opacity = 1,
              fillOpacity = .5,
              fillColor = ~permits19(TOTAL),
              label = paste("Total Permits:", census_permits %>% filter(year == 2019) %>% .$TOTAL),
              group = "2019") %>%
  addPolygons(data = census_permits %>% filter(year == 2018),
              stroke = TRUE,
              color = ~indicator_pal(indicator),
              weight = 2,
              opacity = 1,
              fillOpacity = .5,
              fillColor = ~permits18(TOTAL),
              group = "2018",
              label = paste("Total Permits:", census_permits %>% filter(year == 2018) %>% .$TOTAL)) %>%
  addLegend(data = census_permits,
    pal = indicator_pal,
            value = ~indicator,
    title = "Gentrification Indicator",
    position = "bottomleft") %>%
   addLayersControl(
    baseGroups = c("2023", "2022", "2021", "2020", "2019", "2018"),
    options = layersControlOptions(collapsed = FALSE)) 

permits_map %>%
  setView("-75.139446","39.996849",zoom = 11)

```

When we just look at our gentrification indicator alone (without permits), we can see that not all development is a result of gentrification, but there are still some areas where the two overlap. Although this project aims to provide targeted areas for Housing Trust Funds, we have to recognize that permits alone do not perfectly align with gentrification, but merely reflect new development.

```{r}
census_permits %>%
  filter(!is.na(year)) %>%
  st_sf() %>%
  ggplot()+
  geom_sf(aes(fill = indicator)) +
  scale_fill_manual(values = palette1) +
  mapTheme() +
  labs(title = "Gentrification Indicators", caption = "Figure 7")
```

## **FEATURE ENGINEERING**

### Spatial and spatial time lags, Census data

Developing spatial and spatial time lags can help us capture the spatial endogeneity of gentrification as discussed [here](https://urbanspatialanalysis.com/portfolio/predicting-gentrification-using-longitudinal-census-data/#:~:text=Endogenous%20gentrification&text=Because%20areas%20in%20close%20proximity,but%20with%20lower%20housing%20costs). We created a spatial weights matrix based on contiguity relationships of the census tracts sharing edges and vertices. Specifically, we looked at the nearest neighbor for prices, rents, and demographics of neighboring census tracts as possible key variables to predict for permit development. 

We also created a lag for permits issued to last year's neighbors. Spatial lag and spatial time lag has major predictive potential for our modeling efforts. The average permit count for the nearest census tract can be helpful, but utilizing the nearest neighbor for social demographic data has the potential to be more influence. As high-demand neighborhoods become more price unattainable, residents priced out try to remain close and choose neighborhoods with some adjacent characteristics such as bachelors degerees, income, or white residents, and therefore more likelihood of higher social integration. 

```{r}
nb <- st_contiguity(census_permits$geometry, queen = TRUE)
wt <- st_weights(nb)

census_permits <- census_permits %>%
  mutate(nn_mean_rent_22 = st_lag(Med_Rent_2022,nb,wt, na_ok = TRUE),
      nn_mean_price_22 = st_lag(Med_Value_2022, nb,wt, na_ok = TRUE),
      nn_mean_rent_17 = st_lag(Med_Rent_2017,nb,wt, na_ok = TRUE),
      nn_mean_price_17 = st_lag(Med_Value_2017, nb,wt, na_ok = TRUE),
      nn_bach_22 = st_lag(Percent_bach_2022, nb,wt, na_ok = TRUE),
      nn_bach_17 = st_lag(Percent_bach_2017, nb,wt, na_ok = TRUE),
      nn_income_22 = st_lag(Med_Inc_2022, nb,wt, na_ok = TRUE),
      nn_income_17 = st_lag(Med_Inc_2017, nb,wt, na_ok = TRUE),
      nn_pct_white_22 = st_lag(Percent_White_2022, nb,wt, na_ok = TRUE),
      nn_pct_white_17 = st_lag(Percent_White_2017, nb,wt, na_ok = TRUE)
      )

census_permits <- census_permits %>%
  group_by(GEOID) %>%
  arrange(GEOID, year) %>%
  mutate(last_year_TOTAL = dplyr::lag(TOTAL),
         last_year_TOTAL = ifelse(is.na(last_year_TOTAL), TOTAL, last_year_TOTAL)) %>%
  ungroup() %>%
  mutate(nn_LY_permits = st_lag(last_year_TOTAL, nb, wt))

```

### Local Moran's I

Moran's I can help identify spatial autocorrelation, and similarity among clusters of values. Moran's I values throughout Philadelphia are largely above 1, indicating statistically significant clusters wherein nearest neighbors are having an impact on permit count. Below, we can see that that most of Philadelphia has very low Moran's I values except for Fishtown and University City, which are outliers. Similarly, there are higher probability values in the areas surrounding Fishtown and University City, particularly in West Philly. 

Higher probability for the nearest neighborhoods surrounding the high demand neighborhoods suggest that there is some uncertainty regarding prediction for these tracts. Although this generally indicates that our confidence for predicting these tracts is lower, the clustering of these probability values suggests that the nearest neighbor and spatial lag is indicative of some factor, even if we are not identifying it with exactness.

```{r}

nb <- poly2nb(census_permits$geometry, queen=TRUE)
wt <- nb2listw(nb, style="W", zero.policy=TRUE)

local_morans <- localmoran(census_permits$TOTAL, wt, zero.policy=TRUE) %>%
  as.data.frame() %>%
  rename(c("Local_Morans_I" = "Ii", "P_Value" = "Pr(z != E(Ii))"))

census_permits <- 
  cbind(census_permits, local_morans %>% select(Local_Morans_I, P_Value)) %>% 
  st_sf() %>%
  mutate(Significant_Hotspots = ifelse(P_Value <= 0.001, 1, 0)) 

#Plot Morans I

localMorans <- 
  cbind(local_morans, census_permits) %>% 
  st_sf() %>%
  dplyr::select(TOTAL,
                Local_Morans_I,
                P_Value) %>%
  mutate(Significant_Hotspots = ifelse(P_Value <= 0.001, 1, 0)) %>%
  gather(Variable, Value, -geometry)

## This is just for plotting
vars <- unique(localMorans$Variable)
varList <- list()

for(i in vars){
  varList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(localMorans, Variable == i), 
              aes(fill = Value), colour=NA) +
      scale_fill_viridis_b(name="") +
      labs(title=i) +
      mapTheme(title_size = 14) + theme(legend.position="bottom")
}


do.call(grid.arrange,c(varList, ncol = 4, top = "Local Morans I statistics, Total Permits"))
  
```

### **Additional Feature Engineering**

#### Coffee shops

Finally, we generated a few amenity features such as stores or infrastructure that may make a neighborhood more attractive. These include cafes from 2018 to 2023 since the presence of new cafes may be correlated with gentrification or new development. 

```{r results = 'hide'}
#coffee shops by year, then join to census_permits.

 # cafes18 <- getbb('Philadelphia, PA') %>%
 #   opq(datetime = '2018-01-01T00:00:01Z') %>%
 #   add_osm_feature(key = "amenity", value = "cafe") %>%
 #   osmdata_sf()
 # 
 # cafes18.sf <- st_as_sf(cafes18$osm_points)
 # 
 # cafes18.sf <- cafes18.sf %>%
 #   st_transform(crs = st_crs(census_permits)) %>%
 #   st_join(census_permits, join = st_within)
 # 
 # cafes18.sf <- cafes18.sf %>%
 #   filter(!is.na(GEOID)) %>%
 #   group_by(GEOID) %>%
 #   summarise(cafe_count = n()) %>%
 #   mutate(cafeyear = 2018)
 # 
 # cafes19 <- getbb('Philadelphia, PA') %>%
 #   opq(datetime = '2019-01-01T00:00:01Z') %>%
 #   add_osm_feature(key = "amenity", value = "cafe") %>%
 #   osmdata_sf()
 # 
 # cafes19.sf <- st_as_sf(cafes19$osm_points) %>%
 #   st_transform(crs = st_crs(census_permits)) %>%
 #      st_join(census_permits, join = st_within) %>%
 #   filter(!is.na(GEOID)) %>%
 #   group_by(GEOID) %>%
 #   summarise(cafe_count = n()) %>%
 #   mutate(cafeyear = 2019)
 # 
 # cafes20 <- getbb('Philadelphia, PA') %>%
 #   opq(datetime = '2020-01-01T00:00:01Z') %>%
 #   add_osm_feature(key = "amenity", value = "cafe") %>%
 #   osmdata_sf()
 # 
 # cafes20.sf <- st_as_sf(cafes20$osm_points) %>%
 #   st_transform(crs = st_crs(census_permits)) %>%
 #   st_join(census_permits, join = st_within) %>%
 #   filter(!is.na(GEOID)) %>%
 #   group_by(GEOID) %>%
 #   summarise(cafe_count = n()) %>%
 #   mutate(cafeyear = 2020)
 # 
 # cafes21 <- getbb('Philadelphia, PA') %>%
 #   opq(datetime = '2021-01-01T00:00:01Z') %>%
 #   add_osm_feature(key = "amenity", value = "cafe") %>%
 #   osmdata_sf()
 # 
 #  cafes21.sf <- st_as_sf(cafes21$osm_points) %>%
 #   st_transform(crs = st_crs(census_permits)) %>%
 #   st_join(census_permits, join = st_within) %>%
 #   filter(!is.na(GEOID)) %>%
 #   group_by(GEOID) %>%
 #   summarise(cafe_count = n()) %>%
 #   mutate(cafeyear = 2021)
 # 
 # cafes22 <- getbb('Philadelphia, PA') %>%
 #   opq(datetime = '2022-01-01T00:00:01Z') %>%
 #   add_osm_feature(key = "amenity", value = "cafe") %>%
 #   osmdata_sf()
 # 
 # cafes22.sf <- st_as_sf(cafes22$osm_points) %>%
 #   st_transform(crs = st_crs(census_permits)) %>%
 #   st_join(census_permits, join = st_within) %>%
 #   filter(!is.na(GEOID)) %>%
 #   group_by(GEOID) %>%
 #   summarise(cafe_count = n()) %>%
 #   mutate(cafeyear = 2022)
 # 
 # cafes23 <- getbb('Philadelphia, PA') %>%
 #   opq(datetime = '2023-01-01T00:00:01Z') %>%
 #   add_osm_feature(key = "amenity", value = "cafe") %>%
 #   osmdata_sf()
 # 
 # cafes23.sf <- st_as_sf(cafes23$osm_points) %>%
 #   st_transform(crs = st_crs(census_permits)) %>%
 #   st_join(census_permits, join = st_within) %>%
 #   filter(!is.na(GEOID)) %>%
 #   group_by(GEOID) %>%
 #   summarise(cafe_count = n()) %>%
 #   mutate(cafeyear = 2023)
 # 
 #  cafes_list <- list(cafes18.sf, cafes19.sf, cafes20.sf, cafes21.sf, cafes22.sf,
 #                     cafes23.sf)
 # 
 # all_cafes <- reduce(cafes_list, bind_rows) %>%
 #   rename("year" = "cafeyear") %>%
 #  st_drop_geometry()
 # 
 # 
 # write.csv(all_cafes, "Data/all_cafes.csv")
 # 

# rm(cafe18, cafes18.sf, cafes19, cafes19.sf, cafes20, cafes20.sf, cafes21, cafes21.sf, cafes22, cafes22.sf, cafes23, cafes23.sf)

 all_cafes <- read.csv("Data/all_cafes.csv") %>%
   select(-X)

all_cafes$GEOID <-  as.character(all_cafes$GEOID)


  census_permits <- census_permits %>%
    left_join(all_cafes, by = c("GEOID", "year")) %>%
     mutate(cafe_count = ifelse(is.na(cafe_count), 0, cafe_count))

```

#### Distance to transit stations

We also looked at proximity to fixed public transportation within the city of Philadelphia, including regional rail, the Broad Street Line, Market Frankford Line, and trolley lines. 

```{r results = 'hide', message = FALSE, warning = FALSE}
regionalrail <- st_read("Data/Regional_Rail_Stations.geojson") 

trolleys <- st_read("Data/Trolley_Stations.geojson") %>%
  rename(Line_Name = LineAbbr,
         Latitude = Lat,
         Longitude = Lon,
         Station_Na = StopName) %>%
  select(FID, Line_Name, Station_Na, Latitude, Longitude, geometry)
  
metro <- st_read("Data/Highspeed_Stations.geojson") %>%
  rename(Line_Name = Route, 
         Station_Na = Station)

transit <- rbind(regionalrail, trolleys, metro) %>%
    st_transform(crs = st_crs(phl_tracts22))


phl_transit <- st_join(transit, phl_tracts22, join = st_within) %>%
  filter(!is.na(GEOID)) %>%
  select(Line_Name, Station_Na, Latitude, Longitude, GEOID, geometry) %>%
  st_transform(crs = st_crs(census_permits))

census_permits <- census_permits %>%
  mutate(nearest_transit = nn_function(st_coordinates(st_centroid(census_permits)), st_coordinates(phl_transit), k=1)*69.2)
```

#### Number of jobs within tract

Finally, we looked at job counts per tract. Although we initially were going to look at places such as major universities or distance to center city, ultimately, we recognized that these locations are substitutes for job density. We looked at jobs over time, but overall the proportion remains pretty similar across the years for which we have data. Therefore, we [interpolate](https://walker-data.com/tidycensus/reference/interpolate_pw.html
https://walker-data.com/tidycensus/reference/interpolate_pw.html) the geographies to 2020 tracts and then look at spatial lag -- jobs in surrounding census tracts. This feature could use some further development, because our visual assessment is that people do not want to live in the areas with the highest job density, but a within a certain distance of those job centers. 

```{r}
#use Lehdr package to bring in jobs and to analyze where people are going by work and home census tract. 
#derived from: https://github.com/jamgreen/lehdr 

#pa jobs from LEHDR
#match with lodes data from previous year, grab data for 2019 - 2022
#match the individual permit sheet with the previous years job data
# 
# pa_jobs19 <- grab_lodes(state = "pa",
#                        year = 2019,
#                        version = "LODES8",
#                        lodes_type = "od",
#                        job_type = "JT01",
#                        segment = "S000",
#                        state_part = "main",
#                        agg_geo = "tract") %>%
#   select(-h_tract)
# 
# pa_jobs20 <- grab_lodes(state = "pa",
#                        year = 2020,
#                        version = "LODES8",
#                        lodes_type = "od",
#                        job_type = "JT01",
#                        segment = "S000",
#                        state_part = "main",
#                        agg_geo = "tract")%>%
#   select(-h_tract)
# 
# pa_jobs21 <- grab_lodes(state = "pa",
#                        year = 2021,
#                        version = "LODES8",
#                        lodes_type = "od",
#                        job_type = "JT01",
#                        segment = "S000",
#                        state_part = "main",
#                        agg_geo = "tract")%>%
#   select(-h_tract)
# 
# #filter for only jobs that start or end in Philadelphia tracts
# 
# phl_jobs19 <- pa_jobs19 %>%
#   left_join(phl_tracts22, by = c("w_tract" = "GEOID")) %>%
#     filter(!st_is_empty(geometry)) %>%
#   st_as_sf() %>%
#   rename(GEOID = w_tract) %>%
#   group_by(GEOID) %>%
#   summarise(job_count = n())
# 
# phl_jobs19 <- interpolate_pw(
#   phl_jobs19,
#   phl_tracts22,
#   to_id = "GEOID",
#   extensive = TRUE, 
#   weights = phl_tracts22,
#   weight_column = "Total_Pop",
#   crs = 4269)
# 
# phl_jobs20 <- pa_jobs20 %>%
#   left_join(phl_tracts22, by = c("w_tract" = "GEOID")) %>%
#     filter(!st_is_empty(geometry)) %>%
#   st_as_sf() %>%
#   rename(GEOID = w_tract) %>%
#   group_by(GEOID) %>%
#   summarise(job_count = n())
# 
# phl_jobs21 <- pa_jobs21 %>%
#   left_join(phl_tracts22, by = c("w_tract" = "GEOID")) %>%
#     filter(!st_is_empty(geometry)) %>%
#   st_as_sf() %>%
#   rename(GEOID = w_tract) %>%
#   group_by(GEOID) %>%
#   summarise(job_count = n())
# 
# phl_jobs_combined <- rbind(cbind(phl_jobs19, year = 2019),
#                            cbind(phl_jobs20, year = 2020),
#                            cbind(phl_jobs21, year = 2021))
# 
# ggplot(phl_jobs_combined) +
#   geom_sf(aes(fill = job_count)) +
#   labs(title = "Job Distribution in Philadelphia", color = "Job Count") +
#   mapTheme() +
#   facet_wrap(~year, ncol = 3)
# 
# phl_jobs_combined$year <- as.Numeric(phl_jobs_combined$year)

phl_jobs_combined <- read.csv("Data/phl_jobs.csv") %>%
  select(-X) %>%
  mutate(GEOID = as.character(GEOID)) %>%
  rename("year_jobs" = "year")

census_permits <- census_permits %>%
  mutate(year_jobs = case_when(year == 2018 ~ 2019,
                               year == 2022 ~ 2021,
                             year == 2023 ~ 2021,
                             year %in% c(2019,2020,2021) ~ year)) %>%
  left_join(phl_jobs_combined, 
            by = c("GEOID","year_jobs")) %>%
  select(-year_jobs)

phl_jobs_combined <- phl_jobs_combined %>%
  left_join(phl_tracts22, by = c("GEOID" = "GEOID")) %>%
  select(job_count, GEOID, geometry) %>%
  st_as_sf()

nb <- st_contiguity(census_permits$geometry, queen = TRUE)
wt <- st_weights(nb)

census_permits <- census_permits %>%
   mutate(jobs.nn= st_lag(job_count,nb,wt, na_ok = TRUE))
  
```

#### Neighborhoods

```{r results = 'hide'}
census_permits <- st_join(st_centroid(census_permits), phl.nh, join = st_within) %>%
  st_drop_geometry() %>%
  left_join(phl_tracts4326) %>%
  st_as_sf()
```

### **EXPLORATORY ANALYSIS**

Here, we look at our dependent variable, total permits, as a function of our numeric variables. Last year's permits and the Local Moran's I are most strongly correlated. 

Some of these graphs show that permits spike for values in the middle of the distribution. As an experiment, we could make a binary variable for relationships where there's a spike in the middle - between the 45th and 55th percentile or not. 

```{r, fig.height=30, fig.width = 10}

#filter the outliers 
census_permits_test <- census_permits %>%
  filter(TOTAL < 600) %>%
  dplyr::select(-indicator, -mapname, -name, -Significant_Hotspots) %>%
  gather(variable, value, -geometry, -year, -GEOID, -TOTAL)

ggplot(census_permits_test, aes(value, TOTAL)) +
  geom_point() +
  stat_smooth(aes(value, TOTAL), 
              method = "lm", se = FALSE, size = 1, colour="#ffb600") +
  facet_wrap(~variable, scales= "free", ncol=4) +
  labs(title = "Exploratory Analysis", caption = "Figure 8")

```

We can test this theory out on the variables that appear to have the largest spikes in the middle. 
It turns out this is only true for three variables -- the middle percentile (between 45th and 55th percentile) is related to the number of permits for the change in percentage of homes with a second mortgage between 2017 and 2022 5-Year ACS surveys, for 2017 adjusted mean rent, and for percent owner in 2022.

```{r}
census_permits_test <- census_permits %>%
  mutate(
    middle_bach_17 = ifelse(Bach_2017 > quantile(Bach_2017, .45, na.rm = TRUE) & Bach_2017 <= quantile(Bach_2017, .55, na.rm = TRUE), TRUE, FALSE),
    
    #maybe this one
         middle_rent_17 = ifelse(Med_Rent_2017 > quantile(Med_Rent_2017, .45, na.rm = TRUE) & Med_Rent_2017 <= quantile(Med_Rent_2017, .55, na.rm = TRUE), TRUE, FALSE),
    
         middle_price_NET = ifelse(Med_Value_NET > quantile(Med_Value_NET, .45, na.rm = TRUE) & Med_Value_NET <= quantile(Med_Value_NET, .55, na.rm = TRUE), TRUE, FALSE),
    
         middle_nn_bach_22 = ifelse(nn_bach_22 > quantile(nn_bach_22, .45, na.rm = TRUE) & nn_bach_22 <= quantile(nn_bach_22, .55, na.rm = TRUE), TRUE, FALSE),
    
         middle_nn_income_22 = ifelse(nn_income_22 > quantile(nn_income_22, .45, na.rm = TRUE) & nn_income_22 <= quantile(nn_income_22, .55, na.rm = TRUE), TRUE, FALSE),
    
         middle_nn_mean_price_22 = ifelse(nn_mean_price_22 > quantile(nn_mean_price_22, .45, na.rm = TRUE) & nn_mean_price_22 <= quantile(nn_mean_price_22, .55, na.rm = TRUE), TRUE, FALSE),
    
         middle_nn_mean_rent_22 = ifelse(nn_mean_rent_22 > quantile(nn_mean_rent_22, .45, na.rm = TRUE) & nn_mean_rent_22 <= quantile(nn_mean_rent_22, .55, na.rm = TRUE), TRUE, FALSE),
    
         middle_Owners_PCT = ifelse(Owners_PCT > quantile(Owners_PCT, .45, na.rm = TRUE) & Owners_PCT <= quantile(Owners_PCT, .55, na.rm = TRUE), TRUE, FALSE),
    
    #this one
         middle_Percent_2mort_NET = ifelse(Percent_2mort_NET > quantile(Percent_2mort_NET, .45, na.rm = TRUE) & Percent_2mort_NET <= quantile(Percent_2mort_NET, .55, na.rm = TRUE), TRUE, FALSE),
    
         middle_Percent_bach_2022 = ifelse(Percent_bach_2022 > quantile(Percent_bach_2022, .45, na.rm = TRUE) & Percent_bach_2022 <= quantile(Percent_bach_2022, .55, na.rm = TRUE), TRUE, FALSE),
    
         middle_Percent_bach_PCT = ifelse(Percent_bach_PCT > quantile(Percent_bach_PCT, .45, na.rm = TRUE) & Percent_bach_PCT <= quantile(Percent_bach_PCT, .55, na.rm = TRUE), TRUE, FALSE),
    
    #this one
         middle_Percent_Owner_2022 = ifelse(Percent_Owner_2022 > quantile(Percent_Owner_2022, .45, na.rm = TRUE) & Percent_Owner_2022 <= quantile(Percent_Owner_2022, .55, na.rm = TRUE), TRUE, FALSE),
    
         middle_Rent_Income_Ratio_2017 = ifelse(Rent_Income_Ratio_2017 > quantile(Rent_Income_Ratio_2017, .45, na.rm = TRUE) & Rent_Income_Ratio_2017 <= quantile(Rent_Income_Ratio_2017, .55, na.rm = TRUE), TRUE, FALSE),
    
    #Also testing this new binary variable!
    transit_under_1.5 = ifelse(nearest_transit <= 1.5, TRUE, FALSE)
                                 ) %>%
  
  dplyr::select(TOTAL, starts_with('middle'), transit_under_1.5) %>%
  st_drop_geometry() %>%
  gather(variable, value, -TOTAL)

census_permits_test %>%
  filter(TOTAL < 600) %>%
  group_by(variable, value) %>%
  summarise(mean_permits = mean(TOTAL)) %>%
  ungroup() %>%
    ggplot(aes(value,mean_permits)) + 
      geom_bar(position = "dodge", stat= "identity", fill = "#ffb600") + 
      facet_wrap(~variable, scales = "free") +
      labs(x="Total Permits", y="Value", 
           title = "Middle percentile (45-55th) associations with number of permits", caption = "Figure 9") +
      theme(legend.position = "none") 

# Mutating only for the variables variables that mattered

census_permits <- census_permits %>%
  mutate(middle_Percent_Owner_2022 = ifelse(Percent_Owner_2022 > quantile(Percent_Owner_2022, .45, na.rm = TRUE) & Percent_Owner_2022 <= quantile(Percent_Owner_2022, .55, na.rm = TRUE), TRUE, FALSE),
         
         middle_Percent_2mort_NET = ifelse(Percent_2mort_NET > quantile(Percent_2mort_NET, .45, na.rm = TRUE) & Percent_2mort_NET <= quantile(Percent_2mort_NET, .55, na.rm = TRUE), TRUE, FALSE),
         
          middle_rent_17 = ifelse(Med_Rent_2017 > quantile(Med_Rent_2017, .45, na.rm = TRUE) & Med_Rent_2017 <= quantile(Med_Rent_2017, .55, na.rm = TRUE), TRUE, FALSE),
         
         transit_under_1.5 = ifelse(nearest_transit <= 1.5, TRUE, FALSE))
```


```{r}
#Mapping cafes and vacancy:

# ggplot(census_permits) +
#   geom_sf(aes(fill = cafe_count)) +
#   labs(title = "Number of Cafes per Tract") +
#   facet_wrap(~year) +
#   mapTheme()
# 
# ggplot(census_permits %>% filter(year == 2023)) +
#   geom_sf(aes(fill = Percent_Vacant_2016)) +
#   labs(title = "Vacancy Rate 2016") +
#   mapTheme()

```

#### Correlation Matrix

We compared variables against each other in a correlation test. Many of the variables correlated with each other more than with the count of total permits. The most correlated variables with permit count are vacancy with a positive correlation, wherein more vacancy means higher permit issuance. After some additional analysis and feature engineering however, we realize that close proximity to transit (within 1.5 mile) is more significant than the nearest neighbor transit. Variables are of course correlated with each other, but overall census variables are not as correlated with permit count as expected. 

```{r}

numericVars <- census_permits %>%
  st_drop_geometry() %>%
  select(TOTAL, cafe_count, nearest_transit, job_count, Med_Inc_2022, Rent_Income_Ratio_NET,
         White_Pop_2022, Med_Value_NET, Vacants_2022, Vacants_2017,
         Vacants_PCT, Vacants_NET) 

price.corr <- corr.test(numericVars)

if (sum(is.na(numericVars)) > 0) {
  # Handle missing values (e.g., impute or remove)
  numericVars <- na.omit(numericVars)  # Remove rows with missing values
}

correlation_matrix <- round(cor(numericVars), 1)
p_values <- cor_pmat(numericVars)

ggcorrplot(
  correlation_matrix,
  p.mat = p_values,
  colors = c( "#f20089", "white", "#2d00f7"),
  type = "lower",
  insig = "blank"
  
) +
labs(title = "Degrees of Correlation to Permits Issued", caption = "Figure 10") +
  annotate(
  geom = "rect",
  xmin = .5, xmax = 11.5, ymin = .5, ymax = 1.5,
  fill = "transparent", color = "#ffb600", alpha = 0.5
  

)
```

## **MODELING**

For the baseline, we included census variables including median HH income (CPI inflation adjusted), Median home value, Median rent, Median year structure built, tenure, number of housing units, number of vacancies, count of second mortgages, percent with a Bachelor's degree, and percent white population. 

After doing our exploratory analysis, we narrowed down the variables for our model. For the spatial and temporal lag model, we used these variables, grouped into the following categories:

**TIME/SPACE** <br>
- name (neighborhood) <br>
- year <br>

**GENTRIFICATION INDICATORS** <br>
- indicator (incorporates bachelors degrees, income price changes, rent changes, and population) <br>

**DEMOGRAPHIC** 
(the strongest) <br>
- Bach_NET <br>
- middle_Percent_Owner_2022 <br>
- middle_Percent_2mort_NET <br>

(other demographic variables)
- Med_Value_NET <br>
- Percent_2mort_2022 <br>
- Percent_bach_NET <br>
- Rent_Income_Ratio_2022 <br>
- White_Pop_NET <br>

**AMENITIES** <br>
- cafe_count <br>
- job_count <br>
- transit_under_1.5 <br>

**TIME LAG** <br>
- last_year_TOTAL <br>
- Med_Value_2017 <br>
- Total_Units_2017 <br>
- Vacants_2017 <br>
- middle_rent_17 <br>

**SPATIAL LAG** <br>
- Local Moran's I <br>
- nn_LY_permits <br>
- nn_bach_22 <br>
- nn_income_17 <br>

Several iterations of models were developed over the course of this project, each attempting to isolate and coalesce variables in different combinations to create as accurate a prediction as possible.

We used three cross-validation methods: random 10-fold cross-validation, spatial cross-validation, and temporal cross-validation. By cross validating, we can detect generalizability & accuracy for the model's predictions across time and space in Philadelphia. Cross validation will also help determine whether our model is overfitting. With overtfitting, the model is very accurate but not generalizable to new datasets as new data becomes available and input into the model. 



```{r}
set.seed(152)

census_permits_dat <- census_permits %>%
  filter(year %in% c(2020, 2021, 2022)) %>%
  mutate(cvID = sample(round(nrow(.) / 115.5), 
                       size=nrow(.), replace = TRUE)) %>%
  filter(name != "BYBERRY" & name != "WISSAHICKON_PARK" & name != "GLENWOOD" & name != "WISTER"
         & name != "MCGUIRE")
```


```{r, results = 'hide', message = FALSE}

# BASELINE

reg.baseline.vars <- c("year","indicator","name","Percent_White_2022","Percent_Owner_2022", "Percent_Vacant_2022", "Med_Inc_2022","Med_Value_2022","Med_Rent_2022", "Med_Structure_2022",
                                    "Tot_Units_2022", "Vacants_2022", "Sec_Mort_2022", "Percent_bach_2022",
                                    "Rent_Income_Ratio_2022",
                       "Bach_NET", "middle_Percent_Owner_2022","middle_Percent_2mort_NET",
                       "Med_Value_NET","Percent_bach_NET",  "White_Pop_NET",
                      "cafe_count", "job_count", "transit_under_1.5")

reg.baseline.vars.space <- reg.baseline.vars[reg.baseline.vars != "name"]
reg.baseline.vars.time <- reg.baseline.vars[reg.baseline.vars != "year"]

# SPACE AND TIME - zero inflated

reg.st.vars <- c( "year", "indicator", "name","Bach_NET","middle_Percent_Owner_2022","middle_Percent_2mort_NET",
                                "Med_Value_NET", "Percent_2mort_2022", "Percent_bach_NET",
                                    "Rent_Income_Ratio_2022", "White_Pop_NET", "cafe_count",
                                    "transit_under_1.5",  "last_year_TOTAL",
                                    "Med_Value_2017", "Tot_Units_2017", "Vacants_2017",
                                    "middle_rent_17",  "Local_Morans_I",  "nn_LY_permits",
                                    "nn_bach_22", "nn_income_17", "jobs.nn")

reg.st.space <-reg.st.vars[reg.st.vars != "name"] 
reg.st.time <- reg.st.vars[reg.st.vars != "year"]

# Baseline
reg.baseline.random <- crossValidate(
  dataset = census_permits_dat,
  id = "cvID",                           
  dependentVariable = "TOTAL",
  indVariables = reg.baseline.vars) %>%
    dplyr::select(cvID = cvID, TOTAL, Prediction, name)

reg.baseline.cv.time <- crossValidate(
  dataset = census_permits_dat,
  id = "year",                           
  dependentVariable = "TOTAL",
  indVariables = reg.baseline.vars.time) %>%
    dplyr::select(cvID = year, TOTAL, Prediction, name)

reg.baseline.cv.space <- crossValidate(
  dataset = census_permits_dat,
  id = "name",                           
  dependentVariable = "TOTAL",
  indVariables = reg.baseline.vars.space) %>%
    dplyr::select(cvID = name, TOTAL, Prediction) %>%
  mutate(name = cvID) %>%
  dplyr::select(cvID, TOTAL, Prediction, name, geometry)

# space/time lag

reg.st.cv.random <- crossValidate(
  dataset = census_permits_dat,
  id = "cvID",                           
  dependentVariable = "TOTAL",
  indVariables = reg.st.vars) %>%
    dplyr::select(cvID = cvID, TOTAL, Prediction, name)

reg.st.cv.time <- crossValidate(
  dataset = census_permits_dat,
  id = "year",                           
  dependentVariable = "TOTAL",
  indVariables = reg.st.time) %>%
    dplyr::select(cvID = year, TOTAL, Prediction, name)

reg.st.cv.space <- crossValidate(
  dataset = census_permits_dat,
  id = "name",                           
  dependentVariable = "TOTAL",
  indVariables = reg.st.space) %>%
    dplyr::select(cvID = name, TOTAL, Prediction) %>%
  mutate(name = cvID) %>%
  dplyr::select(cvID, TOTAL, Prediction, name, geometry)

reg.summary <- 
  rbind(
    mutate(reg.baseline.random,           Error = Prediction - TOTAL,
                             Regression = "Random k-fold CV: Baseline Variables"),
    
    mutate(reg.st.cv.random,           Error = Prediction - TOTAL,
                             Regression = "Random k-fold CV: Spatial and Temporal Lag"),
    
    mutate(reg.baseline.cv.space,    Error = Prediction - TOTAL,
                             Regression = "Spatial LOGO-CV: Baseline Variables"),
    
    mutate(reg.st.cv.space,    Error = Prediction - TOTAL,
                             Regression = "Spatial LOGO-CV: Spatial and Temporal Lag"),
    
    mutate(reg.baseline.cv.time,    Error = Prediction - TOTAL,
                             Regression = "Temporal LOGO-CV: Baseline Variables"),
    
    mutate(reg.st.cv.time,    Error = Prediction - TOTAL,
                             Regression = "Temporal LOGO-CV: Spatial and Temporal Lag")
    
    ) %>%
    st_sf() 

```

### Explore Errors


Mean absolute error is quite low and similar across all cross validation methods with only a few outliers. These outliers have significant count differences, but the rest of the neighborhoods are overal the same. The spatial and temporal lag model when cross validated at first appears to be the strongest model with the lowest error, which suggests that time and space lag factors are the most influential predictors. 


```{r}
error_by_reg <- 
  reg.summary %>%
    group_by(Regression) %>% 
    summarize(Mean_Error = mean(Error, na.rm = T),
              MAE = mean(abs(Error), na.rm = T),
              SD_MAE = sd(abs(Error), na.rm = T)) 

error_by_reg_nh <- 
  reg.summary %>%
    group_by(Regression, name) %>% 
    summarize(Mean_Error = mean(Prediction - TOTAL, na.rm = T),
              MAE = mean(abs(Prediction - TOTAL), na.rm = T),
              SD_MAE = sd(abs(Prediction - TOTAL), na.rm = T)) %>%
  ungroup()

## plot histogram of errors
error_by_reg_nh %>%
  ggplot(aes(MAE)) + 
    geom_histogram(bins = 30, colour="black", fill = "#f20089") +
  facet_wrap(~Regression, ncol = 2) +
    labs(title="Distribution of MAE by Regression and Neighborhood",
         x="Mean Absolute Error", y="Count")


```

#### Spatial Distribution of Errors

When we examine the temporal and spatial lag in further detail, it becomes apparent that the neighborhoods with highest outlier in predicted count are near the Fishtown and Southwest Philly neighborhoods. Other than these, error is quite small throughout the rest of Philadelphia. 

```{r}

reg.summary %>%
  filter(Regression == "Temporal LOGO-CV: Spatial and Temporal Lag") %>%
  group_by(name) %>%
  st_drop_geometry() %>%
  summarize(residuals = mean(Error, na.rm = T)) %>%
  ungroup() %>% 
  left_join(phl.nh, by = c("name" = "name")) %>%
    st_sf() %>%
    ggplot() + 
      geom_sf(aes(fill = residuals)) +
      scale_fill_gradient2(low = "#8900f2" , mid =  "#ffb600", high = "#e500a4", name = "Error") +
      labs(title = "Spatial and Temporal Lag: \nMean Absolute Error by Neighborhood", caption = "Figure 10") +
      mapTheme()


```

Finally, when plotting predicted permits compared with observed permits, we can see that the observed and predicted permits are fairly close but not so much that the model is overfitting. It is therefore generalizable and overall accurate. 

```{r}

reg.summary %>%
  filter(Regression == "Temporal LOGO-CV: Spatial and Temporal Lag") %>%  
  ggplot(aes(TOTAL, Prediction)) +
  geom_point() +
  stat_smooth(aes(TOTAL, TOTAL), 
             method = "lm", se = FALSE, size = 1, colour="#ffb600") + 
  stat_smooth(aes(Prediction, TOTAL), 
              method = "lm", se = FALSE, size = 1, colour="#f20089") +
  labs(title="Predicted permits as a function of observed permits",
       subtitle="Orange line represents a perfect prediction; Pink line represents prediction", caption = "Figure 21") +
  xlab("Actual Permits") +
  ylab("Predicted Permits") +
  plotTheme()



```

Our spatial regression model cross validates on a neighborhood level across time, while our temporal regression cross validates year by year. Regression error for the temporal logo cross validation with spatial and temporal lag have the lowest mean absolute error as well as standard deviation, which remains consistent with our other findings of that model. It is clear that this model is generalizable, accurate, and overall a good fit across measurements. Because cross validating on a temporal scale appears to have the lowest mean MAE and Standard Deviation, we can be fairly certain that time is key to predicting permit issuance. This is also important since it means that additional years of data will ideally be able to predict permit issuance with future years of datasets. 


```{r}
st_drop_geometry(error_by_reg_nh) %>%
  group_by(Regression) %>% 
    summarize(Mean_MAE = round(mean(MAE), 2),
              SD_MAE = round(sd(MAE), 2)) %>%
  kable() %>%
    kable_styling("striped", full_width = F) %>%
    row_spec(3, color = "black", background = "#8900f2") %>%
    row_spec(4, color = "black", background = "#8900f2") %>%
    row_spec(5, color = "black", background = "#ffb600") %>%
    row_spec(6, color = "black", background = "#ffb600")



```

## **CONCLUSION**

### Discussion

Overall, this model is effective and when coupled with indicators of gentrification, can help predict which neighborhoods are most likely areas with high amounts of development and experiencing intense neighborhood change. To prevent displacement for long-term and socioeconomically disadvantaged residents, community land trusts and housing trust funds can fund long-term solutions. Since it is time-consuming to develop, **fund**, and implement these types of solutions, anticipating a few years out can prevent displacement and create long-term affordable housing. Targeting areas that are experiencing early indicators of gentrification but not high amounts of permit development is a critical strategy. 

### Use Case Application


Identifying aras for community land trust investment can help as a long term strategy for anti-displacement, like zoning. [Community land trusts](https://groundedsolutions.org/strengthening-neighborhoods/community-land-trusts) will be influential in preventing Philadelphia from pushing out its long-term residents. 
Identifying areas for community land trust investment can help as a long term strategy to fight displacement. We hope that this can help secure affordable housing and predict the best areas to secure it for the future. 

For a brief presentation of this markdown, please visit our [video here]()
